Shift-reduce parser

Last updated

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. [1] The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

Contents

Overview

A shift-reduce parser scans and parses the input text in one forward pass over the text, without backing up. The parser builds up the parse tree incrementally, bottom up, and left to right, without guessing or backtracking. At every point in this pass, the parser has accumulated a list of subtrees or phrases of the input text that have been already parsed. Those subtrees are not yet joined together because the parser has not yet reached the right end of the syntax pattern that will combine them.

Shift-reduce parse tree built bottom-up in numbered steps. Shift-Reduce Parse Steps Numbered.svg
Shift-reduce parse tree built bottom-up in numbered steps.

Consider the string A = B + C * 2.

At step 7 in the example, only "A = B +" has been parsed. Only the shaded lower-left corner of the parse tree exists. None of the parse tree nodes numbered 8 and above exist yet. Nodes 1, 2, 6, and 7 are the roots of isolated subtrees covering all the items 1..7. Node 1 is variable A, node 2 is the delimiter =, node 6 is the summand B, and node 7 is the operator +. These four root nodes are temporarily held in a parse stack. The remaining unparsed portion of the input stream is "C * 2".

A shift-reduce parser works by doing some combination of Shift steps and Reduce steps, hence the name.

The parser continues with these steps until all of the input has been consumed and all of the parse trees have been reduced to a single tree representing an entire legal input.

Tree building steps

At every parse step, the entire input text is divided into parse stack, current lookahead symbol, and remaining unscanned text. The parser's next action is determined by the rightmost stack symbol(s) and the lookahead symbol. The action is read from a table containing all syntactically valid combinations of stack and lookahead symbols.

StepParse StackLook
Ahead
UnscannedParser Action
0emptyid= B + C*2Shift
1id=B + C*2Shift
2id =id+ C*2Shift
3id = id+C*2Reduce by Value ← id
4id = Value+C*2Reduce by Products ← Value
5id = Products+C*2Reduce by Sums ← Products
6id = Sums+C*2Shift
7id = Sums +id*2Shift
8id = Sums + id*2Reduce by Value ← id
9id = Sums + Value*2Reduce by Products ← Value
10id = Sums + Products*2Shift
11id = Sums + Products *inteofShift
12id = Sums + Products * inteofReduce by Value ← int
13id = Sums + Products * ValueeofReduce by Products ← Products * Value
14id = Sums + ProductseofReduce by Sums ← Sums + Products
15id = SumseofReduce by Assign ← id = Sums
16AssigneofDone

See [2] for a simpler example.

Grammars

A grammar is the set of patterns or syntax rules for the input language. It doesn't cover all language rules, such as the size of numbers, or the consistent use of names and their definitions in the context of the whole program. Shift-reduce parsers use a context-free grammar that deals just with local patterns of symbols.

An example grammar as a tiny subset of the Java or C language capable of matching A = B + C*2 might be:

Assign ← id = Sums
Sums ← Sums + Products
Sums ← Products
Products ← Products * Value
Products ← Value
Value ← int
Value ← id

The grammar's terminal symbols are the multi-character symbols or 'tokens' found in the input stream by a lexical scanner. Here these include =+* and int for any integer constant, and id for any identifier name. The grammar doesn't care what the int values or id spellings are, nor does it care about blanks or line breaks. The grammar uses these terminal symbols but does not define them. They are always at the bottom bushy end of the parse tree.

The capitalized terms like Sums are nonterminal symbols. These are names for concepts or patterns in the language. They are defined in the grammar and never occur themselves in the input stream. They are always above the bottom of the parse tree. They only happen as a result of the parser applying some grammar rule. Some nonterminals are defined with two or more rules; these are alternative patterns. Rules can refer back to themselves. This grammar uses recursive rules to handle repeated math operators. Grammars for complete languages use recursive rules to handle lists, parenthesized expressions and nested statements.

Any given computer language can be described by several different grammars. The grammar for a shift-reduce parser must be unambiguous itself, or be augmented by tie-breaking precedence rules. This means there is only one correct way to apply the grammar to a given legal example of the language, resulting in a unique parse tree and a unique sequence of shift/reduce actions for that example.

A table-driven parser has all of its knowledge about the grammar encoded into unchanging data called parser tables. The parser's program code is a simple generic loop that applies unchanged to many grammars and languages. The tables may be worked out by hand for precedence methods. For LR methods, the complex tables are mechanically derived from a grammar by some parser generator tool like Bison. [3] The parser tables are usually much larger than the grammar. In other parsers that are not table-driven, such as recursive descent, each language construct is parsed by a different subroutine, specialized to the syntax of that one construct.

Parser actions

The shift-reduce parser is efficient because it involves no backing up. Its total execution time scales linearly with the length of the input and the size of the complete parse tree. Other parser methods that backtrack may take exponential time when they guess badly.[ citation needed ]

To avoid guessing, the shift-reduce parser often looks ahead (to the right in left-to-right text) at the next scanned symbol before deciding what to do with previously scanned symbols. The lexical scanner works one symbol ahead of the rest of the parser. The lookahead symbol is also called the 'right-hand context' for each parsing decision. (Rarely, two or more lookahead symbols may be used, although most practical grammars can be designed to use one lookahead symbol.)

A shift-reduce parser waits until it has scanned and parsed all parts of some construct before committing to what the combined construct is. The parser then acts immediately on the combination instead of waiting any further. In the parse tree example above, the phrase B gets reduced to Value and then to Products and Sums in steps 3-6 as soon as + is seen in lookahead, rather than waiting any longer to organize those parts of the parse tree. The decisions how to handle B are based only on what the parser and scanner have already seen, without considering things that appear much later to the right.

Reductions reorganize the most recently parsed things, that is, those immediately to the left of the lookahead symbol. So the list of already-parsed things acts like a stack. This parse stack grows rightwards. The base or bottom of the stack is on the left and holds the leftmost, oldest parse fragment. Every reduction step acts only on the rightmost, newest parse fragments. (This accumulative parse stack is very unlike the predictive, leftward-growing parse stack used by top-down parsers.)

When a grammar rule such as

Products ← Products * Value

is applied, the stack top holds the parse trees "... Products * Value". This found instance of the rule's right hand side is called the handle. The reduce step replaces the handle "Products * Value" by the left hand side non terminal, in this case a larger Products. If the parser builds complete parse trees, then the three trees for inner Products, *, and Value are combined by a new tree root for the larger Products. Otherwise, semantic details from the inner Products and Value are output to some later compiler pass, or are combined and saved in the new Products symbol. [4]

The parser keeps applying reductions to the top of the parse stack for as long as it keeps finding newly completed examples of grammar rules there. When no more rules can be applied, the parser then shifts the lookahead symbol onto the parse stack, scans a new lookahead symbol, and tries again.

Types of shift-reduce parsers

The parser tables show what to do next, for every legal combination of topmost parse stack symbols and lookahead symbol. That next action must be unique; either shift, or reduce, but not both. (This implies some further limitations on the grammar, beyond being unambiguous.) The table details vary greatly between different types of shift-reduce parsers.

In precedence parsers, the right end of handles are found by comparing the precedence level or grammar tightness of the top stack symbols to that of the lookahead symbol. In the example above, int and id belong to inner grammar levels compared to the statement delimiter ;. So int and id are both considered to be higher precedence than ; and should be reduced to something else whenever followed by ;. There are different varieties of precedence parsers, each with different ways of finding the handle's left end and choosing the correct rule to apply:

Precedence parsers are limited in the grammars they can handle. They ignore most of the parse stack when making decisions. They consider only the names of the topmost symbols, not the full context of where in the grammar those symbols are now appearing. Precedence requires that similar-looking symbol combinations must be parsed and used in identical ways throughout the grammar however those combinations happen regardless of context.

LR parsers are a more flexible form of shift-reduce parsing, handling many more grammars. [8]

LR parser processing

LR parsers function like a state machine, performing a state transition for each shift or reduce action. These employ a stack where the current state is pushed (down) by shift actions. This stack is subsequently popped (up) by reduce actions (and which concurrently stacks a new state). This mechanism allows the LR parser to handle all deterministic context-free grammars, a superset of precedence grammars. The LR parser is fully implemented by the Canonical LR parser. The Look-Ahead LR and Simple LR parsers implement simplified variants of it that have significantly reduced memory requirements. [9] [10] Recent research has identified methods by which canonical LR parsers may be implemented with dramatically reduced table requirements over Knuth's table-building algorithm. [11]

Whether LR, LALR or SLR, the basic state machine is the same; only the tables are different, and these tables are almost always mechanically generated. Additionally, these tables are usually implemented such that a REDUCE results in a CALL to a closed subroutine which is external to the state machine and which performs a function which is implied by the semantics of the grammar rule which is being REDUCEd. Therefore, the parser is partitioned into an invariant state machine part, and a variant semantics part. This fundamental distinction encourages the development of high-quality parsers which are exceptionally reliable.

Given a specific stack state and lookahead symbol, there are precisely four possible actions, ERROR, SHIFT, REDUCE, and STOP (hereinafter referred to as configurations). The presence of a dot, •, in a configuration represents the current lookahead position, with the lookahead symbol shown to the right of the dot (and which always corresponds to a terminal symbol), and the current stack state to the left of the dot (and which usually corresponds to a nonterminal symbol).

For practical reasons, including higher performance, the tables are usually extended by a somewhat large, auxiliary array of two-bit symbols, obviously compressed into four two-bit symbols, a byte, for efficient accessing on byte-oriented machines, often encoded as:

00b represents ERROR
01b represents SHIFT
10b represents REDUCE
11b represents STOP

(STOP being a special case of SHIFT). The entire array generally includes mostly ERROR configurations, a grammar-defined number of SHIFT and REDUCE configurations, and one STOP configuration.

In programming systems which support the specification of values in quaternary numeral system (base 4, two bits per quaternary digit), such as XPL, these are coded as, for example:

"(2)…0…" represents ERROR
"(2)…1…" represents SHIFT
"(2)…2…" represents REDUCE
"(2)…3…" represents STOP

The SHIFT and the REDUCE tables are implemented separately from the array. The auxiliary array is "probed" only for the current state and lookahead symbol. The (auxiliary) array is "full", whereas the (shift and reduce) tables may be very "sparse" indeed, and significant efficiencies may be achieved through optimal "decomposition" of those SHIFT and REDUCE tables (ERROR and STOP need no tables).

The SHIFT and REDUCE configurations are obvious, from the basic definition of a SHIFT-REDUCE parser.

STOP, then, represents a configuration where the state at the top of the stack and the lookahead terminal symbol is within the subject grammar, and represents the ending of the program:

<program>

it being impossible to SHIFT beyond the final so as to reach, conceptually

<program>

ERROR, then, represents a configuration where the state at the top of the stack, and the lookahead terminal symbol is not within the subject grammar. This presents an opportunity to invoke an error recovery procedure, perhaps, in its most simplistic form, to discard the lookahead terminal symbol and to read the next terminal symbol, but many other programmed actions are possible, including pruning the stack, or discarding the lookahead terminal symbol and pruning the stack (and in a pathological case, it is usually possible to obtain

<program>

where <program> consists only of a "null statement").

In most cases, the stack is purposely pre-loaded, that is, initialized, with

<program>

whereby the initial is assumed to have already been recognized. This, then, represents the beginning of the program, and, thereby, avoids having a separate START configuration, which is, conceptually

<program>

is a special pseudo-terminal symbol mechanically added to the grammar, just as <program> is a special pseudo-nonterminal symbol mechanically added to the grammar (if the programmer did not explicitly include <program> in the grammar, then <program> would be automatically added to the grammar on the programmer's behalf).

Clearly, such a parser has precisely one (implicit) START configuration and one (explicit) STOP configuration, but it can, and usually does have hundreds of SHIFT and REDUCE configurations, and perhaps thousands of ERROR configurations.

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, without guesswork or backtracking. The parser is mechanically generated from a formal grammar for the language.

A canonical LR parser is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In programming language theory, the associativity of an operator is a property that determines how operators of the same precedence are grouped in the absence of parentheses. If an operand is both preceded and followed by operators, and those operators have equal precedence, then the operand may be used as input to two different operations. The choice of which operations to apply the operand to, is determined by the associativity of the operators. Operators may be associative, left-associative, right-associative or non-associative. The associativity and precedence of an operator is a part of the definition of the programming language; different programming languages may have different associativity and precedence for the same type of operator.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computer science, parsing reveals the grammatical structure of linear input text, as a first step in working out its meaning. Bottom-up parsing recognizes the text's lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

In computer science, a simple precedence parser is a type of bottom-up parser for context-free grammars that can be used only by simple precedence grammars.

An operator precedence grammar is a kind of grammar for formal languages.

A GLR parser is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundation was provided in a 1974 paper by Bernard Lang. It describes a systematic way to produce such algorithms, and provides uniform results regarding correctness proofs, complexity with respect to grammar classes, and optimization techniques. The first actual implementation of GLR was described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser". Tomita presented five stages in his original work, though in practice it is the second stage that is recognized as the GLR parser.

SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.

References

  1. Compilers: Principles, Techniques, and Tools (2nd Edition), by Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman, Prentice Hall 2006.
  2. "Archived copy" (PDF). dragonbook.stanford.edu. Archived from the original (PDF) on 5 March 2016. Retrieved 17 January 2022.{{cite web}}: CS1 maint: archived copy as title (link)
  3. Flex & Bison: Text Processing Tools, by John Levine, O'Reilly Media 2009.
  4. Crafting a Compiler, by Fischer, Ron, and Richard, Addison Wesley 2009.
  5. PL360 - A Programming Language for the 360 Computers, by Niklaus Wirth, J. ACM 15:1 1968.
  6. The Theory of Parsing, Translation, and Compiling, Volume 1: Parsing, by Alfred Aho and Jeffrey Ullman, Prentice Hall 1972.
  7. A Compiler Generator, by William M. McKeeman, J Horning, and D Wortman, Prentice Hall 1970; ISBN   978-0131550773.
  8. Knuth, D. E. (July 1965). "On the translation of languages from left to right" (PDF). Information and Control. 8 (6): 607–639. doi: 10.1016/S0019-9958(65)90426-2 . Retrieved 29 May 2011.
  9. Practical Translators for LR(k) Languages, by Frank DeRemer, MIT PhD dissertation 1969.
  10. Simple LR(k) Grammars, by Frank DeRemer, Comm. ACM 14:7 1971.
  11. X. Chen, Measuring and extending LR(1) parsing, University of Hawaii PhD dissertation, 2009.