Earley parser

Last updated
Earley parser
Class Parsing grammars that are context-free
Data structure String
Worst-case performance
Best-case performance
Average performance

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars. [1] The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation [2] in 1968 (and later appeared in an abbreviated, more legible, form in a journal [3] ).

Contents

Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , [4] and linear time for all deterministic context-free grammars. It performs particularly well when the rules are written left-recursively.

Earley recogniser

The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.

The algorithm

In the following descriptions, α, β, and γ represent any string of terminals/nonterminals (including the empty string), X and Y represent single nonterminals, and a represents a terminal symbol.

Earley's algorithm is a top-down dynamic programming algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.

Input position 0 is the position prior to input. Input position n is the position after accepting the nth token. (Informally, input positions can be thought of as locations at token boundaries.) For every input position, the parser generates a state set. Each state is a tuple (X → α • β, i), consisting of

(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)

A state is finished when its current position is the last position of the right side of the production, that is, when there is no symbol to the right of the dot • in the visual representation of the state.

The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then repeatedly executes three operations: prediction, scanning, and completion.

Duplicate states are not added to the state set, only new ones. These three operations are repeated until no new states can be added to the set. The set is generally implemented as a queue of states to process, with the operation to be performed depending on what kind of state it is.

The algorithm accepts if (X → γ •, 0) ends up in S(n), where (X → γ) is the top level-rule and n the input length, otherwise it rejects.

Pseudocode

Adapted from Speech and Language Processing [5] by Daniel Jurafsky and James H. Martin,

DECLAREARRAYS;functionINIT(words)SCREATE_ARRAY(LENGTH(words)+1)forkfrom0toLENGTH(words)doS[k]EMPTY_ORDERED_SETfunctionEARLEY_PARSE(words,grammar)INIT(words)ADD_TO_SET((γS,0),S[0])forkfrom0toLENGTH(words)doforeachstateinS[k]do// S[k] can expand during this loopifnotFINISHED(state)thenifNEXT_ELEMENT_OF(state)isanonterminalthenPREDICTOR(state,k,grammar)// non_terminalelsedoSCANNER(state,k,words)// terminalelsedoCOMPLETER(state,k)endendreturnchartprocedurePREDICTOR((Aα•Bβ,j),k,grammar)foreach(Bγ)inGRAMMAR_RULES_FOR(B,grammar)doADD_TO_SET((B•γ,k),S[k])endprocedureSCANNER((Aα•aβ,j),k,words)ifj<LENGTH(words)andaPARTS_OF_SPEECH(words[k])thenADD_TO_SET((Aαa•β,j),S[k+1])endprocedureCOMPLETER((Bγ•,x),k)foreach(Aα•Bβ,j)inS[x]doADD_TO_SET((AαB•β,j),S[k])end

Example

Consider the following simple grammar for arithmetic expressions:

<P> ::= <S>      # the start rule <S> ::= <S> "+" <M> | <M> <M> ::= <M> "*" <T> | <T> <T> ::= "1" | "2" | "3" | "4" 

With the input:

2 + 3 * 4

This is the sequence of state sets:

(state no.)Production(Origin)Comment
S(0): • 2 + 3 * 4
1P → • S0start rule
2S → • S + M0predict from (1)
3S → • M0predict from (1)
4M → • M * T0predict from (3)
5M → • T0predict from (3)
6T → • number0predict from (5)
S(1): 2 • + 3 * 4
1T → number •0scan from S(0)(6)
2M → T •0complete from (1) and S(0)(5)
3M → M • * T0complete from (2) and S(0)(4)
4S → M •0complete from (2) and S(0)(3)
5S → S • + M0complete from (4) and S(0)(2)
6P → S •0complete from (4) and S(0)(1)
S(2): 2 + • 3 * 4
1S → S + • M0scan from S(1)(5)
2M → • M * T2predict from (1)
3M → • T2predict from (1)
4T → • number2predict from (3)
S(3): 2 + 3 • * 4
1T → number •2scan from S(2)(4)
2M → T •2complete from (1) and S(2)(3)
3M → M • * T2complete from (2) and S(2)(2)
4S → S + M •0complete from (2) and S(2)(1)
5S → S • + M0complete from (4) and S(0)(2)
6P → S •0complete from (4) and S(0)(1)
S(4): 2 + 3 * • 4
1M → M * • T2scan from S(3)(3)
2T → • number4predict from (1)
S(5): 2 + 3 * 4 •
1T → number •4scan from S(4)(2)
2M → M * T •2complete from (1) and S(4)(1)
3M → M • * T2complete from (2) and S(2)(2)
4S → S + M •0complete from (2) and S(2)(1)
5S → S • + M0complete from (4) and S(0)(2)
6P → S •0complete from (4) and S(0)(1)

The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.

Constructing the parse forest

Earley's dissertation [6] briefly describes an algorithm for constructing parse trees by adding a set of pointers from each non-terminal in an Earley item back to the items that caused it to be recognized. But Tomita noticed [7] that this does not take into account the relations between symbols, so if we consider the grammar S → SS | b and the string bbb, it only notes that each S can match one or two b's, and thus produces spurious derivations for bb and bbbb as well as the two correct derivations for bbb.

Another method [8] is to build the parse forest as you go, augmenting each Earley item with a pointer to a shared packed parse forest (SPPF) node labelled with a triple (s, i, j) where s is a symbol or an LR(0) item (production rule with dot), and i and j give the section of the input string derived by this node. A node's contents are either a pair of child pointers giving a single derivation, or a list of "packed" nodes each containing a pair of pointers and representing one derivation. SPPF nodes are unique (there is only one with a given label), but may contain more than one derivation for ambiguous parses. So even if an operation does not add an Earley item (because it already exists), it may still add a derivation to the item's parse forest.

SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.

Optimizations

Philippe McLean and R. Nigel Horspool in their paper "A Faster Earley Parser" combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude.

See also

Citations

  1. Kegler, Jeffrey. "What is the Marpa algorithm?" . Retrieved 20 August 2013.
  2. Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.
  3. Earley, Jay (1970), "An efficient context-free parsing algorithm" (PDF), Communications of the ACM , 13 (2): 94–102, doi:10.1145/362007.362035, S2CID   47032707, archived from the original (PDF) on 2004-07-08
  4. John E. Hopcroft and Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Reading/MA: Addison-Wesley. ISBN   978-0-201-02988-8. p.145
  5. Jurafsky, D. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall. ISBN   9780131873216.
  6. Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. p. 106. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.
  7. Tomita, Masaru (April 17, 2013). Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Springer Science and Business Media. p. 74. ISBN   978-1475718850 . Retrieved 16 September 2015.
  8. Scott, Elizabeth (April 1, 2008). "SPPF-Style Parsing From Earley Recognizers". Electronic Notes in Theoretical Computer Science. 203 (2): 53–67. doi: 10.1016/j.entcs.2008.03.044 .

Other reference materials

Implementations

C, C++

Haskell

Java

C#

JavaScript

OCaml

Perl

Python

Rust

Common Lisp

Scheme, Racket

Wolfram

Resources

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

A canonical LR parser is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. It is a type of caching, distinct from other forms of caching such as buffering and page replacement. In the context of some logic programming languages, memoization is also known as tabling.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

A GLR parser is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundation was provided in a 1974 paper by Bernard Lang. It describes a systematic way to produce such algorithms, and provides uniform results regarding correctness proofs, complexity with respect to grammar classes, and optimization techniques. The first actual implementation of GLR was described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser". Tomita presented five stages in his original work, though in practice it is the second stage that is recognized as the GLR parser.

In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

<span class="mw-page-title-main">History of compiler construction</span>

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.