Earley parser

Earley parser
Class	Parsing grammars that are context-free
Data structure	String
Worst-case performance
Best-case performance	for all deterministic context-free grammars ; for unambiguous grammars ;
Average performance

Last updated November 21, 2024

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars.^[1] The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation^[2] in 1968 (and later appeared in an abbreviated, more legible, form in a journal^[3]).

Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case ${O}(n^{3})$ , where n is the length of the parsed string, quadratic time for unambiguous grammars ${O}(n^{2})$ ,^[4] and linear time for all deterministic context-free grammars. It performs particularly well when the rules are written left-recursively.

Earley recogniser

The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.

The algorithm

In the following descriptions, α, β, and γ represent any string of terminals/nonterminals (including the empty string), X and Y represent single nonterminals, and a represents a terminal symbol.

Earley's algorithm is a top-down dynamic programming algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.

Input position 0 is the position prior to input. Input position n is the position after accepting the nth token. (Informally, input positions can be thought of as locations at token boundaries.) For every input position, the parser generates a state set. Each state is a tuple (X → α • β, i), consisting of

the production currently being matched (X → α β)
the current position in that production (visually represented by the dot •)
the position i in the input at which the matching of this production began: the origin position

(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)

A state is finished when its current position is the last position of the right side of the production, that is, when there is no symbol to the right of the dot • in the visual representation of the state.

The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then repeatedly executes three operations: prediction, scanning, and completion.

Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).
Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1).
Completion: For every state in S(k) of the form (Y → γ •, j), find all states in S(j) of the form (X → α • Y β, i) and add (X → α Y • β, i) to S(k).

Duplicate states are not added to the state set, only new ones. These three operations are repeated until no new states can be added to the set. The set is generally implemented as a queue of states to process, with the operation to be performed depending on what kind of state it is.

The algorithm accepts if (X → γ •, 0) ends up in S(n), where (X → γ) is the top level-rule and n the input length, otherwise it rejects.

Pseudocode

Adapted from Speech and Language Processing^[5] by Daniel Jurafsky and James H. Martin,

DECLAREARRAYS;functionINIT(words)S←CREATE_ARRAY(LENGTH(words)+1)fork←from0toLENGTH(words)doS[k]←EMPTY_ORDERED_SETfunctionEARLEY_PARSE(words,grammar)INIT(words)ADD_TO_SET((γ→•S,0),S[0])fork←from0toLENGTH(words)doforeachstateinS[k]do// S[k] can expand during this loopifnotFINISHED(state)thenifNEXT_ELEMENT_OF(state)isanonterminalthenPREDICTOR(state,k,grammar)// non_terminalelsedoSCANNER(state,k,words)// terminalelsedoCOMPLETER(state,k)endendreturnchartprocedurePREDICTOR((A→α•Bβ,j),k,grammar)foreach(B→γ)inGRAMMAR_RULES_FOR(B,grammar)doADD_TO_SET((B→•γ,k),S[k])endprocedureSCANNER((A→α•aβ,j),k,words)ifj<LENGTH(words)anda⊂PARTS_OF_SPEECH(words[k])thenADD_TO_SET((A→αa•β,j),S[k+1])endprocedureCOMPLETER((B→γ•,x),k)foreach(A→α•Bβ,j)inS[x]doADD_TO_SET((A→αB•β,j),S[k])end

Example

Consider the following simple grammar for arithmetic expressions:

<P> ::= <S>      # the start rule <S> ::= <S> "+" <M> | <M> <M> ::= <M> "*" <T> | <T> <T> ::= "1" | "2" | "3" | "4"

With the input:

2 + 3 * 4

This is the sequence of state sets:

(state no.)	Production	(Origin)	Comment
S(0): • 2 + 3 * 4
1	P → • S	0	start rule
2	S → • S + M	0	predict from (1)
3	S → • M	0	predict from (1)
4	M → • M * T	0	predict from (3)
5	M → • T	0	predict from (3)
6	T → • number	0	predict from (5)
S(1): 2 • + 3 * 4
1	T → number •	0	scan from S(0)(6)
2	M → T •	0	complete from (1) and S(0)(5)
3	M → M • * T	0	complete from (2) and S(0)(4)
4	S → M •	0	complete from (2) and S(0)(3)
5	S → S • + M	0	complete from (4) and S(0)(2)
6	P → S •	0	complete from (4) and S(0)(1)
S(2): 2 + • 3 * 4
1	S → S + • M	0	scan from S(1)(5)
2	M → • M * T	2	predict from (1)
3	M → • T	2	predict from (1)
4	T → • number	2	predict from (3)
S(3): 2 + 3 • * 4
1	T → number •	2	scan from S(2)(4)
2	M → T •	2	complete from (1) and S(2)(3)
3	M → M • * T	2	complete from (2) and S(2)(2)
4	S → S + M •	0	complete from (2) and S(2)(1)
5	S → S • + M	0	complete from (4) and S(0)(2)
6	P → S •	0	complete from (4) and S(0)(1)
S(4): 2 + 3 * • 4
1	M → M * • T	2	scan from S(3)(3)
2	T → • number	4	predict from (1)
S(5): 2 + 3 * 4 •
1	T → number •	4	scan from S(4)(2)
2	M → M * T •	2	complete from (1) and S(4)(1)
3	M → M • * T	2	complete from (2) and S(2)(2)
4	S → S + M •	0	complete from (2) and S(2)(1)
5	S → S • + M	0	complete from (4) and S(0)(2)
6	P → S •	0	complete from (4) and S(0)(1)

The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.

Constructing the parse forest

Earley's dissertation^[6] briefly describes an algorithm for constructing parse trees by adding a set of pointers from each non-terminal in an Earley item back to the items that caused it to be recognized. But Tomita noticed^[7] that this does not take into account the relations between symbols, so if we consider the grammar S → SS | b and the string bbb, it only notes that each S can match one or two b's, and thus produces spurious derivations for bb and bbbb as well as the two correct derivations for bbb.

Another method^[8] is to build the parse forest as you go, augmenting each Earley item with a pointer to a shared packed parse forest (SPPF) node labelled with a triple (s, i, j) where s is a symbol or an LR(0) item (production rule with dot), and i and j give the section of the input string derived by this node. A node's contents are either a pair of child pointers giving a single derivation, or a list of "packed" nodes each containing a pair of pointers and representing one derivation. SPPF nodes are unique (there is only one with a given label), but may contain more than one derivation for ambiguous parses. So even if an operation does not add an Earley item (because it already exists), it may still add a derivation to the item's parse forest.

Predicted items have a null SPPF pointer.
The scanner creates an SPPF node representing the non-terminal it is scanning.
Then when the scanner or completer advance an item, they add a derivation whose children are the node from the item whose dot was advanced, and the one for the new symbol that was advanced over (the non-terminal or completed item).

SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.

Optimizations

Philippe McLean and R. Nigel Horspool in their paper "A Faster Earley Parser" combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude.

Citations

↑ Kegler, Jeffrey. "What is the Marpa algorithm?" . Retrieved 20 August 2013.
↑ Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.
↑ Earley, Jay (1970), "An efficient context-free parsing algorithm" (PDF), Communications of the ACM , 13 (2): 94–102, doi:10.1145/362007.362035, S2CID 47032707, archived from the original (PDF) on 2004-07-08
↑ John E. Hopcroft and Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Reading/MA: Addison-Wesley. ISBN 978-0-201-02988-8. p.145
↑ Jurafsky, D. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall. ISBN 9780131873216.
↑ Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. p. 106. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.
↑ Tomita, Masaru (April 17, 2013). Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Springer Science and Business Media. p. 74. ISBN 978-1475718850 . Retrieved 16 September 2015.
↑ Scott, Elizabeth (April 1, 2008). "SPPF-Style Parsing From Earley Recognizers". Electronic Notes in Theoretical Computer Science. 203 (2): 53–67. doi: 10.1016/j.entcs.2008.03.044 .

Other reference materials

Aycock, John; Horspool, R. Nigel (2002). "Practical Earley Parsing". The Computer Journal . 45 (6): 620–630. CiteSeerX 10.1.1.12.4254 . doi:10.1093/comjnl/45.6.620.
Leo, Joop M. I. M. (1991), "A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead", Theoretical Computer Science , 82 (1): 165–176, doi: 10.1016/0304-3975(91)90180-A , MR 1112117

Tomita, Masaru (1984). "LR parsers for natural languages" (PDF). COLING. 10th International Conference on Computational Linguistics. pp. 354–357.

Related Research Articles

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, without guesswork or backtracking. The parser is mechanically generated from a formal grammar for the language.

A canonical LR parser is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."

In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each production is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via machine learning. A probabilistic grammar's validity is constrained by context of its training dataset.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, $can be recognized as a sum because it can be broken into, also a sum, and, a suitable suffix.$

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

A GLR parser is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundation was provided in a 1974 paper by Bernard Lang. It describes a systematic way to produce such algorithms, and provides uniform results regarding correctness proofs, complexity with respect to grammar classes, and optimization techniques. The first actual implementation of GLR was described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser". Tomita presented five stages in his original work, though in practice it is the second stage that is recognized as the GLR parser.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Kegler, Jeffrey. "What is the Marpa algorithm?" . Retrieved 20 August 2013.

[Earley1-2] Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.

[Earley2-3] Earley, Jay (1970), "An efficient context-free parsing algorithm" (PDF), Communications of the ACM , 13 (2): 94–102, doi:10.1145/362007.362035, S2CID 47032707, archived from the original (PDF) on 2004-07-08

[4] John E. Hopcroft and Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Reading/MA: Addison-Wesley. ISBN 978-0-201-02988-8. p.145

[Jurafsky-5] Jurafsky, D. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall. ISBN 9780131873216.

[Earley3-6] Earley, Jay (1968). An Efficient Context-Free Parsing Algorithm (PDF). Carnegie-Mellon Dissertation. p. 106. Archived from the original (PDF) on 2017-09-22. Retrieved 2012-09-12.

[7] Tomita, Masaru (April 17, 2013). Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Springer Science and Business Media. p. 74. ISBN 978-1475718850 . Retrieved 16 September 2015.

[8] Scott, Elizabeth (April 1, 2008). "SPPF-Style Parsing From Earley Recognizers". Electronic Notes in Theoretical Computer Science. 203 (2): 53–67. doi: 10.1016/j.entcs.2008.03.044 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Parsing algorithms
Top-down	Earley LL Recursive descent Tail recursive
Bottom-up	Precedence Simple Operator Shunting-yard LR Simple Look-ahead Canonical Generalized CYK Recursive ascent Shift-reduce
Mixed, other	Combinator Chart Left corner Statistical
Related topics	PEG Definite clause grammar Deterministic parsing Dynamic programming Memoization Parser generator LALR Parse tree AST Scannerless parsing History of compiler construction Comparison of parser generators Operator-precedence grammar