Class | Parsing grammars that are context-free |
---|---|
Data structure | String |
Worst-case performance | |
Best-case performance | |
Average performance |
In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars. [1] The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation [2] in 1968 (and later appeared in an abbreviated, more legible, form in a journal [3] ).
Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , [4] and linear time for all deterministic context-free grammars. It performs particularly well when the rules are written left-recursively.
The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.
In the following descriptions, α, β, and γ represent any string of terminals/nonterminals (including the empty string), X and Y represent single nonterminals, and a represents a terminal symbol.
Earley's algorithm is a top-down dynamic programming algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
Input position 0 is the position prior to input. Input position n is the position after accepting the nth token. (Informally, input positions can be thought of as locations at token boundaries.) For every input position, the parser generates a state set. Each state is a tuple (X → α • β, i), consisting of
(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)
A state is finished when its current position is the last position of the right side of the production, that is, when there is no symbol to the right of the dot • in the visual representation of the state.
The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then repeatedly executes three operations: prediction, scanning, and completion.
Duplicate states are not added to the state set, only new ones. These three operations are repeated until no new states can be added to the set. The set is generally implemented as a queue of states to process, with the operation to be performed depending on what kind of state it is.
The algorithm accepts if (X → γ •, 0) ends up in S(n), where (X → γ) is the top level-rule and n the input length, otherwise it rejects.
Adapted from Speech and Language Processing [5] by Daniel Jurafsky and James H. Martin,
DECLAREARRAYS;functionINIT(words)S←CREATE_ARRAY(LENGTH(words)+1)fork←from0toLENGTH(words)doS[k]←EMPTY_ORDERED_SETfunctionEARLEY_PARSE(words,grammar)INIT(words)ADD_TO_SET((γ→•S,0),S[0])fork←from0toLENGTH(words)doforeachstateinS[k]do// S[k] can expand during this loopifnotFINISHED(state)thenifNEXT_ELEMENT_OF(state)isanonterminalthenPREDICTOR(state,k,grammar)// non_terminalelsedoSCANNER(state,k,words)// terminalelsedoCOMPLETER(state,k)endendreturnchartprocedurePREDICTOR((A→α•Bβ,j),k,grammar)foreach(B→γ)inGRAMMAR_RULES_FOR(B,grammar)doADD_TO_SET((B→•γ,k),S[k])endprocedureSCANNER((A→α•aβ,j),k,words)ifj<LENGTH(words)anda⊂PARTS_OF_SPEECH(words[k])thenADD_TO_SET((A→αa•β,j),S[k+1])endprocedureCOMPLETER((B→γ•,x),k)foreach(A→α•Bβ,j)inS[x]doADD_TO_SET((A→αB•β,j),S[k])end
Consider the following simple grammar for arithmetic expressions:
<P> ::= <S> # the start rule <S> ::= <S> "+" <M> | <M> <M> ::= <M> "*" <T> | <T> <T> ::= "1" | "2" | "3" | "4"
With the input:
2 + 3 * 4
This is the sequence of state sets:
(state no.) | Production | (Origin) | Comment |
---|---|---|---|
S(0): • 2 + 3 * 4 | |||
1 | P → • S | 0 | start rule |
2 | S → • S + M | 0 | predict from (1) |
3 | S → • M | 0 | predict from (1) |
4 | M → • M * T | 0 | predict from (3) |
5 | M → • T | 0 | predict from (3) |
6 | T → • number | 0 | predict from (5) |
S(1): 2 • + 3 * 4 | |||
1 | T → number • | 0 | scan from S(0)(6) |
2 | M → T • | 0 | complete from (1) and S(0)(5) |
3 | M → M • * T | 0 | complete from (2) and S(0)(4) |
4 | S → M • | 0 | complete from (2) and S(0)(3) |
5 | S → S • + M | 0 | complete from (4) and S(0)(2) |
6 | P → S • | 0 | complete from (4) and S(0)(1) |
S(2): 2 + • 3 * 4 | |||
1 | S → S + • M | 0 | scan from S(1)(5) |
2 | M → • M * T | 2 | predict from (1) |
3 | M → • T | 2 | predict from (1) |
4 | T → • number | 2 | predict from (3) |
S(3): 2 + 3 • * 4 | |||
1 | T → number • | 2 | scan from S(2)(4) |
2 | M → T • | 2 | complete from (1) and S(2)(3) |
3 | M → M • * T | 2 | complete from (2) and S(2)(2) |
4 | S → S + M • | 0 | complete from (2) and S(2)(1) |
5 | S → S • + M | 0 | complete from (4) and S(0)(2) |
6 | P → S • | 0 | complete from (4) and S(0)(1) |
S(4): 2 + 3 * • 4 | |||
1 | M → M * • T | 2 | scan from S(3)(3) |
2 | T → • number | 4 | predict from (1) |
S(5): 2 + 3 * 4 • | |||
1 | T → number • | 4 | scan from S(4)(2) |
2 | M → M * T • | 2 | complete from (1) and S(4)(1) |
3 | M → M • * T | 2 | complete from (2) and S(2)(2) |
4 | S → S + M • | 0 | complete from (2) and S(2)(1) |
5 | S → S • + M | 0 | complete from (4) and S(0)(2) |
6 | P → S • | 0 | complete from (4) and S(0)(1) |
The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.
Earley's dissertation [6] briefly describes an algorithm for constructing parse trees by adding a set of pointers from each non-terminal in an Earley item back to the items that caused it to be recognized. But Tomita noticed [7] that this does not take into account the relations between symbols, so if we consider the grammar S → SS | b and the string bbb, it only notes that each S can match one or two b's, and thus produces spurious derivations for bb and bbbb as well as the two correct derivations for bbb.
Another method [8] is to build the parse forest as you go, augmenting each Earley item with a pointer to a shared packed parse forest (SPPF) node labelled with a triple (s, i, j) where s is a symbol or an LR(0) item (production rule with dot), and i and j give the section of the input string derived by this node. A node's contents are either a pair of child pointers giving a single derivation, or a list of "packed" nodes each containing a pair of pointers and representing one derivation. SPPF nodes are unique (there is only one with a given label), but may contain more than one derivation for ambiguous parses. So even if an operation does not add an Earley item (because it already exists), it may still add a derivation to the item's parse forest.
SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.
Philippe McLean and R. Nigel Horspool in their paper "A Faster Earley Parser" combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude.
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form
In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.
In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.
In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.
In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.
In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.
In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, without guesswork or backtracking. The parser is mechanically generated from a formal grammar for the language.
A canonical LR parser is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."
In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each production is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via machine learning. A probabilistic grammar's validity is constrained by context of its training dataset.
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.
An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.
In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.
The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.
In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.
ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.
A GLR parser is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundation was provided in a 1974 paper by Bernard Lang. It describes a systematic way to produce such algorithms, and provides uniform results regarding correctness proofs, complexity with respect to grammar classes, and optimization techniques. The first actual implementation of GLR was described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser". Tomita presented five stages in his original work, though in practice it is the second stage that is recognized as the GLR parser.
Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.
A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.
SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.