Canonical LR parser

Last updated September 07, 2024

A canonical LR parser (also called a LR(1) parser) is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."

Formally, a canonical LR parser is an LR(k) parser for k=1, i.e. with a single lookahead terminal. The special attribute of this parser is that any LR(k) grammar with k>1 can be transformed into an LR(1) grammar.^[1] However, back-substitutions are required to reduce k and as back-substitutions increase, the grammar can quickly become large, repetitive and hard to understand. LR(k) can handle all deterministic context-free languages.^[1] In the past this LR(k) parser has been avoided because of its huge memory requirements in favor of less powerful alternatives such as the LALR and the LL(1) parser. Recently, however, a "minimal LR(1) parser" whose space requirements are close to LALR parsers^{[ citation needed ]}, is being offered by several parser generators.

Like most parsers, the LR(1) parser is automatically generated by compiler-compilers like GNU Bison, MSTA, Menhir,^[2] HYACC,^[3] and LRSTAR.^[4]

History

In 1965 Donald Knuth invented the LR(k) parser (Left to right, Rightmost derivation parser) a type of shift-reduce parser, as a generalization of existing precedence parsers. This parser has the potential of recognizing all deterministic context-free languages and can produce both left and right derivations of statements encountered in the input file. Knuth proved that it reaches its maximum language recognition power for k=1 and provided a method for transforming LR(k), k > 1 grammars into LR(1) grammars.^[1]

Canonical LR(1) parsers have the practical disadvantage of having enormous memory requirements for their internal parser-table representation. In 1969, Frank DeRemer suggested two simplified versions of the LR parser called LALR and SLR. These parsers require much less memory than Canonical LR(1) parsers, but have slightly less language-recognition power.^[5] LALR(1) parsers have been the most common implementations of the LR Parser.

However, a new type of LR(1) parser, some people call a "Minimal LR(1) parser" was introduced in 1977 by David Pager^[6] who showed that LR(1) parsers can be created whose memory requirements rival those of LALR(1) parsers. Recently^{[ when? ]}, some parser generators are offering Minimal LR(1) parsers, which not only solve the memory requirement problem, but also the mysterious-conflict-problem inherent in LALR(1) parser generators.^{[ citation needed ]} In addition, Minimal LR(1) parsers can use shift-reduce actions, which makes them faster than Canonical LR(1) parsers.

Overview

The LR(1) parser is a deterministic automaton and as such its operation is based on static state transition tables. These codify the grammar of the language it recognizes and are typically called "parsing tables".

The parsing tables of the LR(1) parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the LR(0) parser represent grammar rules of the form

A1 → A B

which means that if we go have input A followed by B then we will reduce the pair to A1 regardless of what follows. After parameterizing such a rule with a lookahead we have:

A1 → A B, a

which means that the reduction will now be performed only if the lookahead terminal is a. This allows for richer languages where a simple rule can have different meanings depending on the lookahead context. For example, in a LR(1) grammar, all of the following rules perform a different reduction in spite of being based on the same state sequence.

A1 → A B, a

A2 → A B, b

A3 → A B, c

A4 → A B, d

The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example,

E1 → B C, d

can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:

A1 → A B, a

A1 → A B, b

A1 → A B, c

E1 → A B, d

In this case A B will be reduced to A1 when the lookahead is a, b or c and an error will be reported when the lookahead is d.

The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means in the following example

Input sequence: A B C
Rules:

A1 → A B

A2 → B C

the sequence can be reduced to

A A2

instead of

A1 C

if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed. Reductions can be produced directly from a terminal as in

X → y

which allows for multiple sequences to appear.

LR(1) parsers have the requirement that each rule should be expressed in a complete LR(1) manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as

X → y

requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as

A1 → A B

where all the possible lookaheads must be enumerated. That is the reason why LR(1) parsers cannot be practically implemented without significant memory optimizations.^[6]

Constructing LR(1) parsing tables

LR(1) parsing tables are constructed in the same way as LR(0) parsing tables with the modification that each Item contains a lookahead terminal. This means, contrary to LR(0) parsers, a different action may be executed, if the item to process is followed by a different terminal.

Parser items

Starting from the production rules of a language, at first the item sets for this language have to be determined. In plain words, an item set is the list of production rules, which the currently processed symbol might be part of. An item set has a one-to-one correspondence to a parser state, while the items within the set, together with the next symbol, are used to decide which state transitions and parser action are to be applied. Each item contains a marker, to note at which point the currently processed symbol appears in the rule the item represents. For LR(1) parsers, each item is specific to a lookahead terminal, thus the lookahead terminal has also been noted inside each item.

For example, assume a language consisting of the terminal symbols 'n', '+', '(', ')', the nonterminals 'E', 'T', the starting rule 'S' and the following production rules:

S → E

E → T

E → ( E )

T → n

T → + T

T → T + n

Items sets will be generated by analog to the procedure for LR(0) parsers. The item set 0 which represents the initial state will be created from the starting rule:

[S → • E, $]

The dot '•' denotes the marker of the current parsing position within this rule. The expected lookahead terminal to apply this rule is noted after the comma. The '$' sign is used to denote 'end of input' is expected, as is the case for the starting rule.

This is not the complete item set 0, though. Each item set must be 'closed', which means all production rules for each nonterminal following a '•' have to be recursively included into the item set until all of those nonterminals are dealt with. The resulting item set is called the closure of the item set we began with.

For LR(1) for each production rule an item has to be included for each possible lookahead terminal following the rule. For more complex languages this usually results in very large item sets, which is the reason for the large memory requirements of LR(1) parsers.

In our example, the starting symbol requires the nonterminal 'E' which in turn requires 'T', thus all production rules will appear in item set 0. At first, we ignore the problem of finding the lookaheads and just look at the case of an LR(0), whose items do not contain lookahead terminals. So the item set 0 (without lookaheads) will look like this:

[S → • E]

[E → • T]

[E → • ( E )]

[T → • n]

[T → • + T]

[T → • T + n]

FIRST and FOLLOW sets

To determine lookahead terminals, so-called FIRST and FOLLOW sets are used. FIRST(A) is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A. FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear immediately after nonterminal B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal. FOLLOW(k,B) of an item set k and a nonterminal B is the union of the follow sets of all items in k where '•' is followed by B. The FIRST sets can be determined directly from the closures of all nonterminals in the language, while the FOLLOW sets are determined from the items under usage of the FIRST sets.

In our example, as one can verify from the full list of item sets below, the first sets are:

FIRST(S) = { n, '+', '(' }

FIRST(E) = { n, '+', '(' }

FIRST(T) = { n, '+' }

Determining lookahead terminals

Within item set 0 the follow sets can be found to be:

FOLLOW(0,S) = { $ }

FOLLOW(0,E) = { $, ')'}

FOLLOW(0,T) = { $, '+', ')' }

From this the full item set 0 for an LR(1) parser can be created, by creating for each item in the LR(0) item set one copy for each terminal in the follow set of the LHS nonterminal. Each element of the follow set may be a valid lookahead terminal:

[S → • E, $]

[E → • T, $]

[E → • T, )]

[E → • ( E ), $]

[E → • ( E ), )]

[T → • n, $]

[T → • n, +]

[T → • n, )]

[T → • + T, $]

[T → • + T, +]

[T → • + T, )]

[T → • T + n, $]

[T → • T + n, +]

[T → • T + n, )]

Creating new item sets

The rest of the item sets can be created by the following algorithm

1. For each terminal and nonterminal symbol A appearing after a '•' in each already existing item set k, create a new item set m by adding to m all the rules of k where '•' is followed by A, but only if m will not be the same as an already existing item set after step 3.

2. shift all the '•'s for each rule in the new item set one symbol to the right

3. create the closure of the new item set

4. Repeat from step 1 for all newly created item sets, until no more new sets appear

In the example we get 5 more sets from item set 0, item set 1 for nonterminal E, item set 2 for nonterminal T, item set 3 for terminal n, item set 4 for terminal '+' and item set 5 for '('.

Item set 1 (E):

[S → E •, $]

Item set 2 (T):

[E → T •, $]

[T → T • + n, $]

[T → T • + n, +]

·

Item set 3 (n):

[T → n •, $]

[T → n •, +]

[T → n •, )]

Item set 4 ('+'):

[T → + • T, $]

[T → + • T, +]

[T → • n, $]

[T → • n, +]

[T → • + T, $]

[T → • + T, +]

[T → • T + n, $]

[T → • T + n, +]

·

Item set 5 ('('):

[E → ( • E ), $]

[E → • T, )]

[E → • ( E ), )]

[T → • n, )]

[T → • n, +]

[T → • + T, )]

[T → • + T, +]

[T → • T + n, )]

[T → • T + n, +]

·

From item sets 2, 4 and 5 several more item sets will be produced. The complete list is quite long and thus will not be stated here. Detailed LR(k) treatment of this grammar can e.g. be found in .

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. The linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, without guesswork or backtracking. The parser is mechanically generated from a formal grammar for the language.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, $can be recognized as a sum because it can be broken into, also a sum, and, a suitable suffix.$

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.

A lookahead LR parser (LALR) generator is a software tool that reads a context-free grammar (CFG) and creates an LALR parser which is capable of parsing files written in the context-free language defined by the CFG. LALR parsers are desirable because they are very fast and small in comparison to other types of parsers.

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

References

1 2 3 Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control. 8 (6): 607–639. doi:10.1016/S0019-9958(65)90426-2.
↑ "What is Menhir?". INRIA, CRISTAL project. Retrieved 29 June 2012.
↑ "HYACC, minimal LR(1) parser generator".
↑ "LRSTAR parser generator".
↑ Franklin L. DeRemer (1969). "Practical Translators for LR(k) languages" (PDF). MIT, PhD Dissertation. Archived from the original (PDF) on April 5, 2012.
1 2 Pager, D. (1977), "A Practical General Method for Constructing LR(k) Parsers", Acta Informatica 7, pp. 249–268

External links

Practical LR(k) Parser Construction HTML page, David Tribble
First & Follow Sets Construction in Python (Narayana Chikkam)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Knuth_1965-1] 1 2 3 Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control. 8 (6): 607–639. doi:10.1016/S0019-9958(65)90426-2.

[2] "What is Menhir?". INRIA, CRISTAL project. Retrieved 29 June 2012.

[3] "HYACC, minimal LR(1) parser generator".

[4] "LRSTAR parser generator".

[DeRemer_1969-5] Franklin L. DeRemer (1969). "Practical Translators for LR(k) languages" (PDF). MIT, PhD Dissertation. Archived from the original (PDF) on April 5, 2012.

[Pager_1977-6] 1 2 Pager, D. (1977), "A Practical General Method for Constructing LR(k) Parsers", Acta Informatica 7, pp. 249–268

[1]

[2]

[3]

[4]

[5]

[6]

v t e Parsing algorithms
Top-down	Earley LL Recursive descent Tail recursive
Bottom-up	Precedence Simple Operator Shunting-yard LR Simple Look-ahead Canonical Generalized CYK Recursive ascent Shift-reduce
Mixed, other	Combinator Chart Left corner Statistical
Related topics	PEG Definite clause grammar Deterministic parsing Dynamic programming Memoization Parser generator LALR Parse tree AST Scannerless parsing History of compiler construction Comparison of parser generators Operator-precedence grammar

Canonical LR parser

Contents

History

Overview

Constructing LR(1) parsing tables

Parser items

FIRST and FOLLOW sets

Determining lookahead terminals

Creating new item sets

Goto

Shift actions

Reduce actions

Related Research Articles

References

External links