Parser combinator

Last updated November 30, 2024

In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

Parsers using combinators have been used extensively in the prototyping of compilers and processors for domain-specific languages such as natural-language user interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated^[1] use of parser combinators to construct natural-language interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992^[2] and monadic parsing in 1996.^[3] S. D. Swierstra also exhibited the practical aspects of parser combinators in 2001.^[4] In 2008, Frost, Hafiz and Callaghan^[5] described a set of parser combinators in the functional programming language Haskell that solve the long-standing problem of accommodating left recursion, and work as a complete top-down parsing tool in polynomial time and space.

Basic idea

In any programming language that has first-class functions, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a context-free grammar (CFG) may have one or more alternatives and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives.

In languages that support operator overloading, a parser combinator can take the form of an infix operator, used to glue different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the formal grammar. As such, implementations can be thought of as executable specifications with all the associated advantages such as readability.

The combinators

To keep the discussion relatively straightforward, we discuss parser combinators in terms of recognizers only. If the input string is of length #input and its members are accessed through an index j, a recognizer is a parser which returns, as output, a set of indices representing indices at which the parser successfully finished recognizing a sequence of tokens that begin at index j. An empty result set indicates that the recognizer failed to recognize any sequence beginning at index j.

The empty recognizer recognizes the empty string. This parser always succeeds, returning a singleton set containing the input index:

empty(j)=\{j\}

A recognizer term x recognizes the terminal x. If the token at index j in the input string is x, this parser returns a singleton set containing j + 1; otherwise, it returns the empty set.

term(x,j)={\begin{cases}\left\{\right\},&j\geq \#input\\\left\{j+1\right\},&j^{th}{\mbox{ element of }}input=x\\\left\{\right\},&{\mbox{otherwise}}\end{cases}}

Given two recognizers p and q, we can define two major parser combinators, one for matching alternative rules and one for sequencing rules:

The ‘alternative’ parser combinator, ⊕, applies each of the recognizers on the same index j and returns the union of the finishing indices of the recognizers:

(p\oplus q)(j)=p(j)\cup q(j)

The 'sequence' combinator, ⊛, applies the first recognizer p to the input index j, and for each finishing index applies the second recognizer q with that as a starting index. It returns the union of the finishing indices returned from all invocations of q:

(p\circledast q)(j)=\bigcup \{q(k):k\in p(j)\}

There may be multiple distinct ways to parse a string while finishing at the same index, indicating an ambiguous grammar. Simple recognizers do not acknowledge these ambiguities; each possible finishing index is listed only once in the result set. For a more complete set of results, a more complicated object such as a parse tree must be returned.

Examples

Consider a highly ambiguous context-free grammar, s ::= ‘x’ s s | ε. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern functional programming language (e.g., Haskell) as s = term ‘x’ <*> s <*> s <+> empty. When the recognizer s is applied at index 2 of the input sequence x x x x x it would return a result set {2,3,4,5}, indicating that there were matches starting at index 2 and finishing at any index between 2 and 5 inclusive.

Shortcomings and solutions

Parser combinators, like all recursive descent parsers, are not limited to the context-free grammars and thus do no global search for ambiguities in the LL(k) parsing First_k and Follow_k sets. Thus, ambiguities are not known until run-time if and until the input triggers them. In such cases, the recursive descent parser may default (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, resulting in semantic confusion (aliasing) in the use of the language. This leads to bugs by users of ambiguous programming languages, which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution that eliminates these bugs is to remove the ambiguities and use a context-free grammar.

The simple implementations of parser combinators have some shortcomings, which are common in top-down parsing. Naïve combinatory parsing requires exponential time and space when parsing an ambiguous context-free grammar. In 1996, Frost and Szydlowski demonstrated how memoization can be used with parser combinators to reduce the time complexity to polynomial.^[6] Later Frost used monads to construct the combinators for systematic and correct threading of memo-table throughout the computation.^[7]

Like any top-down recursive descent parsing, the conventional parser combinators (like the combinators described above) will not terminate while processing a left-recursive grammar (e.g. s ::= s <*> term ‘x’|empty). A recognition algorithm that accommodates ambiguous grammars with direct left-recursive rules is described by Frost and Hafiz in 2006.^[8] The algorithm curtails the otherwise ever-growing left-recursive parse by imposing depth restrictions. That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left-recursion in polynomial time, and to generate compact polynomial-size representations of the potentially exponential number of parse trees for highly ambiguous grammars by Frost, Hafiz and Callaghan in 2007.^[9] This extended algorithm accommodates indirect left recursion by comparing its ‘computed context’ with ‘current context’. The same authors also described their implementation of a set of parser combinators written in the Haskell language based on the same algorithm.^[5]^[10]

Notes

↑ Frost & Launchbury 1989.
↑ Hutton 1992.
↑ Hutton, Graham; Meijer, Erik. Monadic Parser Combinators (PDF) (Report). University of Nottingham. Retrieved 13 February 2023.
↑ Swierstra 2001.
1 2 Frost, Hafiz & Callaghan 2008.
↑ Frost & Szydlowski 1996.
↑ Frost 2003.
↑ Frost & Hafiz 2006.
↑ Frost, Hafiz & Callaghan 2007.
↑ cf. X-SAIGA — executable specifications of grammars

Related Research Articles

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics and computer science, where a function being defined is applied within its own definition. While this apparently defines an infinite number of instances, it is often done in such a way that no infinite loop or infinite chain of references can occur.

In theoretical computer science and mathematics, the theory of computation is the branch that deals with what problems can be solved on a model of computation, using an algorithm, how efficiently they can be solved or to what degree. The field is divided into three major branches: automata theory and formal languages, computability theory, and computational complexity theory, which are linked by the question: "What are the fundamental capabilities and limitations of computers?".

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computability theory, Kleene's recursion theorems are a pair of fundamental results about the application of computable functions to their own descriptions. The theorems were first proved by Stephen Kleene in 1938 and appear in his 1952 book Introduction to Metamathematics. A related theorem, which constructs fixed points of a computable function, is known as Rogers's theorem and is due to Hartley Rogers, Jr.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. It is a type of caching, distinct from other forms of caching such as buffering and page replacement. In the context of some logic programming languages, memoization is also known as tabling.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, $can be recognized as a sum because it can be broken into, also a sum, and, a suitable suffix.$

In computer science, a Van Wijngaarden grammar is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden for the purpose of defining the ALGOL 68 programming language. The resulting specification remains its most notable application.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

References

Burge, William H. (1975). Recursive Programming Techniques . The Systems programming series. Addison-Wesley. ISBN 978-0201144505.
Frost, Richard; Launchbury, John (1989). "Constructing natural language interpreters in a lazy functional language" (PDF). The Computer Journal. Special edition on Lazy Functional Programming. 32 (2): 108–121. doi: 10.1093/comjnl/32.2.108 . Archived from the original on 2013-06-06.{{cite journal}}: CS1 maint: bot: original URL status unknown (link)
Frost, Richard A.; Szydlowski, Barbara (1996). "Memoizing Purely Functional Top-Down Backtracking Language Processors" (PDF). Sci. Comput. Program. 27 (3): 263–288. doi: 10.1016/0167-6423(96)00014-7 .
Frost, Richard A. (2003). "Monadic Memoization towards Correctness-Preserving Reduction of Search". Proceedings of the 16th Canadian Society for Computational Studies of Intelligence Conference on Advances in Artificial Intelligence (AI'03) (PDF). Springer. pp. 66–80. ISBN 978-3-540-40300-5.
Frost, Richard A.; Hafiz, Rahmatullah (2006). "A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time" (PDF). ACM SIGPLAN Notices. 41 (5): 46–54. doi:10.1145/1149982.1149988. S2CID 8006549.
Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2007). "Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars". Proceedings of the 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE: 109–120. CiteSeerX 10.1.1.97.8915 .
Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2008). "Parser Combinators for Ambiguous Left-Recursive Grammars". Practical Aspects of Declarative Languages. ACM-SIGPLAN. Vol. 4902. pp. 167–181. CiteSeerX 10.1.1.89.2132 . doi:10.1007/978-3-540-77442-6_12. ISBN 978-3-540-77441-9.
Hutton, Graham (1992). "Higher-order functions for parsing". Journal of Functional Programming. 2 (3): 323–343. CiteSeerX 10.1.1.34.1287 . doi:10.1017/s0956796800000411. S2CID 31067887.
Okasaki, Chris (1998). "Even higher-order functions for parsing or Why would anyone ever want to use a sixth-order function?". Journal of Functional Programming. 8 (2): 195–199. doi: 10.1017/S0956796898003001 . S2CID 59694674.
Swierstra, S. Doaitse (2001). "Combinator parsers: From toys to tools". Electronic Notes in Theoretical Computer Science. 41: 38–59. doi: 10.1016/S1571-0661(05)80545-6 .
Wadler, Philip (1985). "How to replace failure by a list of successes a method for exception handling, backtracking, and pattern matching in lazy functional languages". Functional Programming Languages and Computer Architecture. Lecture Notes in Computer Science. Vol. 201. pp. 113–128. doi:10.1007/3-540-15975-4_33. ISBN 978-0-387-15975-1 – via Proceedings of a Conference on Functional Programming Languages and Computer Architecture.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[FOOTNOTEFrostLaunchbury1989-1] Frost & Launchbury 1989.

[FOOTNOTEHutton1992-2] Hutton 1992.

[3] Hutton, Graham; Meijer, Erik. Monadic Parser Combinators (PDF) (Report). University of Nottingham. Retrieved 13 February 2023.

[FOOTNOTESwierstra2001-4] Swierstra 2001.

[FOOTNOTEFrostHafizCallaghan2008-5] 1 2 Frost, Hafiz & Callaghan 2008.

[FOOTNOTEFrostSzydlowski1996-6] Frost & Szydlowski 1996.

[FOOTNOTEFrost2003-7] Frost 2003.

[FOOTNOTEFrostHafiz2006-8] Frost & Hafiz 2006.

[FOOTNOTEFrostHafizCallaghan2007-9] Frost, Hafiz & Callaghan 2007.

[10] . X-SAIGA — executable specifications of grammars

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e Parsing algorithms
Top-down	Earley LL Recursive descent Tail recursive
Bottom-up	Precedence Simple Operator Shunting-yard LR Simple Look-ahead Canonical Generalized CYK Recursive ascent Shift-reduce
Mixed, other	Combinator Chart Left corner Statistical
Related topics	PEG Definite clause grammar Deterministic parsing Dynamic programming Memoization Parser generator LALR Parse tree AST Scannerless parsing History of compiler construction Comparison of parser generators Operator-precedence grammar