Operator-precedence grammar

Last updated

An operator precedence grammar is a kind of grammar for formal languages.

Contents

Technically, an operator precedence grammar is a context-free grammar that has the property (among others) [1] that no production has either an empty right-hand side or two adjacent nonterminals in its right-hand side. These properties allow precedence relations to be defined between the terminals of the grammar. A parser that exploits these relations is considerably simpler than more general-purpose parsers, such as LALR parsers. Operator-precedence parsers can be constructed for a large class of context-free grammars.

Precedence relations

Operator precedence grammars rely on the following three precedence relations between the terminals: [2]

RelationMeaning
a yields precedence to b
a has the same precedence as b
a takes precedence over b

These operator precedence relations allow to delimit the handles in the right sentential forms: marks the left end, appears in the interior of the handle, and marks the right end. Contrary to other shift-reduce parsers, all nonterminals are considered equal for the purpose of identifying handles. [3] The relations do not have the same properties as their un-dotted counterparts; e. g. does not generally imply , and does not follow from . Furthermore, does not generally hold, and is possible.

Let us assume that between the terminals ai and ai+1 there is always exactly one precedence relation. Suppose that $ is the end of the string. Then for all terminals b we define: and . If we remove all nonterminals and place the correct precedence relation: , , between the remaining terminals, there remain strings that can be analyzed by an easily developed bottom-up parser.

Example

For example, the following operator precedence relations can be introduced for simple expressions: [4]

They follow from the following facts: [5]

The input string [4]

after adding end markers and inserting precedence relations becomes

Operator precedence parsing

Having precedence relations allows to identify handles as follows: [4]

It is generally not necessary to scan the entire sentential form to find the handle.

Operator precedence parsing algorithm

The algorithm below is from Aho et al.: [6]

If $ is on the top of the stack and ip points to $ then return else     Let a be the top terminal on the stack, and b the symbol pointed to by ip     ifaborabthen         push b onto the stack         advance ip to the next input symbol     else ifabthenrepeat             pop the stack         until the top stack terminal is related by  to the terminal most recently popped     else error() end

Precedence functions

An operator precedence parser usually does not store the precedence table with the relations, which can get rather large. Instead, precedence functions f and g are defined. [7] They map terminal symbols to integers, and so the precedence relations between the symbols are implemented by numerical comparison: must hold if holds, etc.

Not every table of precedence relations has precedence functions, but in practice for most grammars such functions can be designed. [8]

Algorithm for constructing precedence functions

The below algorithm is from Aho et al.: [9]

  1. Create symbols fa and ga for each grammar terminal a and for the end of string symbol;
  2. Partition the created symbols in groups so that fa and gb are in the same group if (there can be symbols in the same group even if their terminals are not connected by this relation);
  3. Create a directed graph whose nodes are the groups. For each pair of terminals do: place an edge from the group of gb to the group of fa if , otherwise if place an edge from the group of fa to that of gb;
  4. If the constructed graph has a cycle then no precedence functions exist. When there are no cycles, let be the length of the longest path from the group of fa and let be the length of the longest path from the group of ga.

Example

Consider the following table (repeated from above): [10]

Using the algorithm leads to the following graph:

    gid       \  fid   f*     \  /      g*     /   f+      | \    |  g+    |  |   g$  f$

from which we extract the following precedence functions from the maximum heights in the directed acyclic graph:

id+*$
f4240
g5130

Operator-precedence languages

The class of languages described by operator-precedence grammars, i.e., operator-precedence languages, is strictly contained in the class of deterministic context-free languages, and strictly contains visibly pushdown languages. [11]

Operator-precedence languages enjoy many closure properties: union, intersection, complementation, [12] concatenation, [11] and they are the largest known class closed under all these operations and for which the emptiness problem is decidable. Another peculiar feature of operator-precedence languages is their local parsability, [13] that enables efficient parallel parsing.

There are also characterizations based on an equivalent form of automata and monadic second-order logic. [14]

Notes

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. Birman originally named his formalism the TMG Schema (TS), after TMG, an early parser generator, but it was later given the name TDPL by Aho and Ullman in their classic anthology The Theory of Parsing, Translation and Compiling.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

A simple precedence grammar is a context-free formal grammar that can be parsed with a simple precedence parser. The concept was first created in 1964 by Claude Pair, and was later rediscovered, from ideas due to Robert Floyd, by Niklaus Wirth and Helmut Weber who published a paper, entitled EULER: a generalization of ALGOL, and its formal definition, published in 1966 in the Communications of the ACM.

In computer science, a Wirth–Weber relationship between a pair of symbols is necessary to determine if a formal grammar is a simple precedence grammar. In such a case, the simple precedence parser can be used. The relationship is named after computer scientists Niklaus Wirth and Helmut Weber.

In computer science, a simple precedence parser is a type of bottom-up parser for context-free grammars that can be used only by simple precedence grammars.

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes how to form strings from an alphabet of a formal language that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

References

Further reading