Left recursion

Last updated

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language (on the left) and a suffix (on the right). For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

Contents

In terms of context-free grammar, a nonterminal is left-recursive if the leftmost symbol in one of its productions is itself (in the case of direct left recursion) or can be made itself by some sequence of substitutions (in the case of indirect left recursion).

Definition

A grammar is left-recursive if and only if there exists a nonterminal symbol that can derive to a sentential form with itself as the leftmost symbol. [1] Symbolically,

,

where indicates the operation of making one or more substitutions, and is any sequence of terminal and nonterminal symbols.

Direct left recursion

Direct left recursion occurs when the definition can be satisfied with only one substitution. It requires a rule of the form

where is a sequence of nonterminals and terminals . For example, the rule

is directly left-recursive. A left-to-right recursive descent parser for this rule might look like

voidExpression(){Expression();match('+');Term();}

and such code would fall into infinite recursion when executed.

Indirect left recursion

Indirect left recursion occurs when the definition of left recursion is satisfied via several substitutions. It entails a set of rules following the pattern

where are sequences that can each yield the empty string, while may be any sequences of terminal and nonterminal symbols at all. Note that these sequences may be empty. The derivation

then gives as leftmost in its final sentential form.

Uses

Left recursion is commonly used as an idiom for making operations left-associative: that an expression a+b-c-d+e is evaluated as (((a+b)-c)-d)+e. In this case, that evaluation order could be achieved as a matter of syntax via the three grammatical rules

These only allow parsing the a+b-c-d+e as consisting of the a+b-c-d and e, where a+b-c-d in turn consists of the a+b-c and d, while a+b-c consists of the a+b and c, etc.

Removing left recursion

Left recursion often poses problems for parsers, either because it leads them into infinite recursion (as in the case of most top-down parsers) or because they expect rules in a normal form that forbids it (as in the case of many bottom-up parsers [ clarification needed ]). Therefore, a grammar is often preprocessed to eliminate the left recursion.

Removing direct left recursion

The general algorithm to remove direct left recursion follows. Several improvements to this method have been made. [2] For a left-recursive nonterminal , discard any rules of the form and consider those that remain:

where:

Replace these with two sets of productions, one set for :

and another set for the fresh nonterminal (often called the "tail" or the "rest"):

Repeat this process until no direct left recursion remains.

As an example, consider the rule set

This could be rewritten to avoid left recursion as

Removing all left recursion

The above process can be extended to eliminate all left recursion, by first converting indirect left recursion to direct left recursion on the highest numbered nonterminal in a cycle.

InputsA grammar: a set of nonterminals and their productions
OutputA modified grammar generating the same language but without left recursion
  1. For each nonterminal :
    1. Repeat until an iteration leaves the grammar unchanged:
      1. For each rule , being a sequence of terminals and nonterminals:
        1. If begins with a nonterminal and :
          1. Let be without its leading .
          2. Remove the rule .
          3. For each rule :
            1. Add the rule .
    2. Remove direct left recursion for as described above.

Step 1.1.1 amounts to expanding the initial nonterminal in the right hand side of some rule , but only if . If was one step in a cycle of productions giving rise to a left recursion, then this has shortened that cycle by one step, but often at the price of increasing the number of rules.

The algorithm may be viewed as establishing a topological ordering on nonterminals: afterwards there can only be a rule if . Note that this algorithm is highly sensitive to the nonterminal ordering; optimizations often focus on choosing this ordering well.

Pitfalls

Although the above transformations preserve the language generated by a grammar, they may change the parse trees that witness strings' recognition. With suitable bookkeeping, tree rewriting can recover the originals, but if this step is omitted, the differences may change the semantics of a parse.

Associativity is particularly vulnerable; left-associative operators typically appear in right-associative-like arrangements under the new grammar. For example, starting with this grammar:

the standard transformations to remove left recursion yield the following:

Parsing the string "1 - 2 - 3" with the first grammar in an LALR parser (which can handle left-recursive grammars) would have resulted in the parse tree:

Left-recursive parsing of a double subtraction Left-recursive-parse-of-a-double-subtraction.png
Left-recursive parsing of a double subtraction

This parse tree groups the terms on the left, giving the correct semantics (1 - 2) - 3.

Parsing with the second grammar gives

Right-recursive parsing of a double subtraction Right-recursive-parse-of-a-double-subtraction.png
Right-recursive parsing of a double subtraction

which, properly interpreted, signifies 1 + (-2 + (-3)), also correct, but less faithful to the input and much harder to implement for some operators. Notice how terms to the right appear deeper in the tree, much as a right-recursive grammar would arrange them for 1 - (2 - 3).

Accommodating left recursion in top-down parsing

A formal grammar that contains left recursion cannot be parsed by a LL(k)-parser or other naive recursive descent parser unless it is converted to a weakly equivalent right-recursive form. In contrast, left recursion is preferred for LALR parsers because it results in lower stack usage than right recursion. However, more sophisticated top-down parsers can implement general context-free grammars by use of curtailment. In 2006, Frost and Hafiz described an algorithm which accommodates ambiguous grammars with direct left-recursive production rules. [3] That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left recursion in polynomial time, and to generate compact polynomial-size representations of the potentially exponential number of parse trees for highly ambiguous grammars by Frost, Hafiz and Callaghan in 2007. [4] The authors then implemented the algorithm as a set of parser combinators written in the Haskell programming language. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In computer science, a Wirth–Weber relationship between a pair of symbols is necessary to determine if a formal grammar is a simple precedence grammar. In such a case, the simple precedence parser can be used. The relationship is named after computer scientists Niklaus Wirth and Helmut Weber.

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In programming languages and type theory, parametric polymorphism allows a single piece of code to be given a "generic" type, using variables in place of actual types, and then instantiated with particular types as needed. Parametrically polymorphic functions and data types are sometimes called generic functions and generic datatypes, respectively, and they form the basis of generic programming.

For parsing algorithms in computer science, the inside–outside algorithm is a way of re-estimating production probabilities in a probabilistic context-free grammar. It was introduced by James K. Baker in 1979 as a generalization of the forward–backward algorithm for parameter estimation on hidden Markov models to stochastic context-free grammars. It is used to compute expectations, for example as part of the expectation–maximization algorithm.

In mathematics, particularly set theory, non-recursive ordinals are large countable ordinals greater than all the recursive ordinals, and therefore can not be expressed using recursive ordinal notations.

Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes how to form strings from an alphabet of a formal language that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

A Hindley–Milner (HM) type system is a classical type system for the lambda calculus with parametric polymorphism. It is also known as Damas–Milner or Damas–Hindley–Milner. It was first described by J. Roger Hindley and later rediscovered by Robin Milner. Luis Damas contributed a close formal analysis and proof of the method in his PhD thesis.

In mathematics, the hyperoperation sequence is an infinite sequence of arithmetic operations (called hyperoperations in this context) that starts with a unary operation (the successor function with n = 0). The sequence continues with the binary operations of addition (n = 1), multiplication (n = 2), and exponentiation (n = 3).

References

  1. Notes on Formal Language Theory and Parsing at the Wayback Machine (archived 2007-11-27). James Power, Department of Computer Science National University of Ireland, Maynooth Maynooth, Co. Kildare, Ireland.JPR02
  2. Moore, Robert C. (May 2000). "Removing Left Recursion from Context-Free Grammars" (PDF). 6th Applied Natural Language Processing Conference: 249–255.
  3. Frost, R.; R. Hafiz (2006). "A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time". ACM SIGPLAN Notices. 41 (5): 46–54. doi:10.1145/1149982.1149988. S2CID   8006549., available from the author at http://hafiz.myweb.cs.uwindsor.ca/pub/p46-frost.pdf Archived 2015-01-08 at the Wayback Machine
  4. Frost, R.; R. Hafiz; P. Callaghan (June 2007). "Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars" (PDF). 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE: 109–120. Archived from the original (PDF) on 2011-05-27.
  5. Frost, R.; R. Hafiz; P. Callaghan (January 2008). "Parser Combinators for Ambiguous Left-Recursive Grammars". Practical Aspects of Declarative Languages (PDF). Lecture Notes in Computer Science. Vol. 4902. pp. 167–181. doi:10.1007/978-3-540-77442-6_12. ISBN   978-3-540-77441-9.