Top-down parsing language

Last updated February 21, 2024

Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s^[1]^[2]^[3] in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. Birman originally named his formalism the TMG Schema (TS), after TMG, an early parser generator, but it was later given the name TDPL by Aho and Ullman in their classic anthology The Theory of Parsing, Translation and Compiling.^[4]

Definition of a TDPL grammar

Formally, a TDPL grammarG is a quadruple consisting of the following components:

A finite set N of nonterminal symbols.
A finite set Σ of terminal symbols that is disjoint from N.
A finite set P of production rules , where a rule has one of the following forms:
- A → ε, where A is a nonterminal and ε is the empty string.
- A → f, where f is a distinguished symbol representing unconditional failure.
- A → a, where a is any terminal symbol.
- A → BC/D, where B, C, and D are nonterminals.

Interpretation of a grammar

A TDPL grammar can be viewed as an extremely minimalistic formal representation of a recursive descent parser, in which each of the nonterminals schematically represents a parsing function. Each of these nonterminal-functions takes as its input argument a string to be recognized, and yields one of two possible outcomes:

success, in which case the function may optionally move forward or consume one or more characters of the input string supplied to it, or
failure, in which case no input is consumed.

Note that a nonterminal-function may succeed without actually consuming any input, and this is considered an outcome distinct from failure.

A nonterminal A defined by a rule of the form A → ε always succeeds without consuming any input, regardless of the input string provided. Conversely, a rule of the form A → f always fails regardless of input. A rule of the form A → a succeeds if the next character in the input string is the terminal a, in which case the nonterminal succeeds and consumes that one terminal; if the next input character does not match (or there is no next character), then the nonterminal fails.

A nonterminal A defined by a rule of the form A → BC/D first recursively invokes nonterminal B, and if B succeeds, invokes C on the remainder of the input string left unconsumed by B. If both B and C succeed, then A in turn succeeds and consumes the same total number of input characters that B and C together did. If either B or C fails, however, then A backtracks to the original point in the input string where it was first invoked, and then invokes D on that original input string, returning whatever result D produces.

Examples

The following TDPL grammar describes the regular language consisting of an arbitrary-length sequence of a's and b's:

S → AS/T

T → BS/E

A → a

B → b

E → ε

The following grammar describes the context-free Dyck language consisting of arbitrary-length strings of matched braces, such as '{}', '{{}{{}}}', etc.:

S → OT/E

T → SU/F

U → CS/F

O → {

C → }

E → ε

F → f

The above examples can be represented equivalently but much more succinctly in parsing expression grammar notation as S←(a/b)* and S←({S})*, respectively.

Generalized TDPL

A slight variation of TDPL, known as Generalized TDPL or GTDPL, greatly increases the apparent expressiveness of TDPL while retaining the same minimalist approach (though they are actually equivalent). In GTDPL, instead of TDPL's recursive rule form A → BC/D, the rule form A → B[C,D] is used. This rule is interpreted as follows: When nonterminal A is invoked on some input string, it first recursively invokes B. If B succeeds, then A subsequently invokes C on the remainder of the input left unconsumed by B, and returns the result of C to the original caller. If B fails, on the other hand, then A invokes D on the original input string, and passes the result back to the caller.

The important difference between this rule form and the A → BC/D rule form used in TDPL is that C and D are never both invoked in the same call to A: that is, the GTDPL rule acts more like a "pure" if/then/else construct using B as the condition.

In GTDPL it is straightforward to express interesting non-context-free languages such as the classic example {aⁿbⁿcⁿ}.

A GTDPL grammar can be reduced to an equivalent TDPL grammar that recognizes the same language, although the process is not straightforward and may greatly increase the number of rules required.^[5] Also, both TDPL and GTDPL can be viewed as very restricted forms of parsing expression grammars, all of which represent the same class of grammars.^[5]

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. It is a type of caching, distinct from other forms of caching such as buffering and page replacement. In the context of some logic programming languages, memoization is also known as tabling.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, $can be recognized as a sum because it can be broken into, also a sum, and, a suitable suffix.$

An operator precedence grammar is a kind of grammar for formal languages.

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

References

↑ Birman, Alexander (1970). The TMG Recognition Schema. ACM Digital Library (phd). Princeton University.
↑ Birman, Alexander; Ullman, Jeffrey D. (October 1970). "Parsing algorithms with backtrack". SWAT '70: Proceedings of the 11th Annual Symposium on Switching and Automata Theory: 153–174. doi:10.1109/SWAT.1970.18.
↑ Birman, Alexander; Ullman, Jeffrey D. (1973). "Parsing algorithms with backtrack" (PDF). Information and Control. 23 (1): 1–34. doi:10.1016/S0019-9958(73)90851-6.
↑ Aho, Alfred V.; Ullman, Jeffrey D. (1972). The Theory of Parsing, Translation and Compiling: Volume 1: Parsing. Upper Saddle River, NJ: Prentice-Hall. pp. 456–485. ISBN 978-0-13-914556-8.
1 2 Ford, Bryan. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation

External links

The Packrat Parsing and Parsing Expression Grammars Page

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Birman, Alexander (1970). The TMG Recognition Schema. ACM Digital Library (phd). Princeton University.

[2] Birman, Alexander; Ullman, Jeffrey D. (October 1970). "Parsing algorithms with backtrack". SWAT '70: Proceedings of the 11th Annual Symposium on Switching and Automata Theory: 153–174. doi:10.1109/SWAT.1970.18.

[3] Birman, Alexander; Ullman, Jeffrey D. (1973). "Parsing algorithms with backtrack" (PDF). Information and Control. 23 (1): 1–34. doi:10.1016/S0019-9958(73)90851-6.

[4] Aho, Alfred V.; Ullman, Jeffrey D. (1972). The Theory of Parsing, Translation and Compiling: Volume 1: Parsing. Upper Saddle River, NJ: Prentice-Hall. pp. 456–485. ISBN 978-0-13-914556-8.

[Ford-5] 1 2 Ford, Bryan. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation

[1]

[2]

[3]

[4]

[5]