LL grammar

Last updated
The C grammar is not LL(1): The bottom part shows a parser that has digested the tokens "int v;main(){" and is about to choose a rule to derive the nonterminal "Stmt". Looking only at the first lookahead token "v", it cannot decide which of both alternatives for "Stmt" to choose, since two input continuations are possible. They can be discriminated by peeking at the second lookahead token (yellow background). Parsing a C program that needs 2 token lookahead.svg
The C grammar is not LL(1): The bottom part shows a parser that has digested the tokens "int v;main(){" and is about to choose a rule to derive the nonterminal "Stmt". Looking only at the first lookahead token "v", it cannot decide which of both alternatives for "Stmt" to choose, since two input continuations are possible. They can be discriminated by peeking at the second lookahead token (yellow background).

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser that constructs a rightmost derivation). A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

Contents

LL parsers are table-based parsers, similar to LR parsers. LL grammars can alternatively be characterized as precisely those that can be parsed by a predictive parser – a recursive descent parser without backtracking – and these can be readily written by hand. This article is about the formal properties of LL grammars; for parsing, see LL parser or recursive descent parser.

Formal definition

Finite case

Given a natural number , a context-free grammar is an LL(k) grammar if

there is at most one production rule such that for some terminal symbol strings ,

An alternative, but equivalent, formal definition is the following: is an LL(k) grammar if, for arbitrary derivations

when the first symbols of agree with those of , then . [3] [4]

Informally, when a parser has derived , with its leftmost nonterminal and already consumed from the input, then by looking at that and peeking at the next symbols of the current input, the parser can identify with certainty the production rule for .

When rule identification is possible even without considering the past input , then the grammar is called a strong LL(k) grammar. [5] In the formal definition of a strong LL(k) grammar, the universal quantifier for is omitted, and is added to the "for some" quantifier for . For every LL(k) grammar, a structurally equivalent strong LL(k) grammar can be constructed. [6]

The class of LL(k) languages forms a strictly increasing sequence of sets: LL(0) ⊊ LL(1) ⊊ LL(2) ⊊ …. [7] It is decidable whether a given grammar G is LL(k), but it is not decidable whether an arbitrary grammar is LL(k) for some k. It is also decidable if a given LR(k) grammar is also an LL(m) grammar for some m. [8]

Every LL(k) grammar is also an LR(k) grammar. An ε -free LL(1) grammar is also an SLR(1) grammar. An LL(1) grammar with symbols that have both empty and non-empty derivations is also an LALR(1) grammar. An LL(1) grammar with symbols that have only the empty derivation may or may not be LALR(1). [9]

LL grammars cannot have rules containing left recursion. [10] Each LL(k) grammar that is ε-free can be transformed into an equivalent LL(k) grammar in Greibach normal form (which by definition does not have rules with left recursion). [11]

Regular case

Let be a terminal alphabet. A partition of is called a regular partition if for every the language is regular.

Let be a context free grammar and let be a regular partition of . We say that is an LL() grammar if, for arbitrary derivations

such that it follows that . [12]

A grammar G is said to be LL-regular (LLR) if there exists a regular partition of such that G is LL(). A language is LL-regular if it is generated by an LL-regular grammar.

LLR grammars are unambiguous and cannot be left-recursive.

Every LL(k) grammar is LLR. Every LL(k) grammar is deterministic, but there exists a LLR grammar that is not deterministic. [13] Hence the class of LLR grammars is strictly larger than the union of LL(k) for each k.

It is decidable whether, given a regular partition , a given grammar is LL(). It is, however, not decidable whether an arbitrary grammar G is LLR. This is due to the fact that deciding whether a grammar G generates a regular language, which would be necessary to find a regular partition for G, can be reduced to the Post correspondence problem.

Every LLR grammar is LR-regular (LRR, the corresponding[ clarify ] equivalent for LR(k) grammars), but there exists an LR(1) grammar that is not LLR. [13]

Historically, LLR grammars followed the invention of the LRR grammars. Given a regular partition a Moore machine can be constructed to transduce the parsing from right to left, identifying instances of regular productions. Once that has been done, an LL(1) parser is sufficient to handle the transduced input in linear time. Thus, LLR parsers can handle a class of grammars strictly larger than LL(k) parsers while being equally efficient. Despite that the theory of LLR does not have any major applications. One possible and very plausible reason is that while there are generative algorithms for LL(k) and LR(k) parsers, the problem of generating an LLR/LRR parser is undecidable unless one has constructed a regular partition upfront. But even the problem of constructing a suitable regular partition given grammar is undecidable.

Simple deterministic languages

A context-free grammar is called simple deterministic, [14] or just simple, [15] if

A set of strings is called a simple deterministic, or just simple, language, if it has a simple deterministic grammar.

The class of languages having an ε-free LL(1) grammar in Greibach normal form equals the class of simple deterministic languages. [16] This language class includes the regular sets not containing ε. [15] Equivalence is decidable for it, while inclusion is not. [14]

Applications

LL grammars, particularly LL(1) grammars, are of great practical interest, as they are easy to parse, either by LL parsers or by recursive descent parsers, and many computer languages [ clarify ] are designed to be LL(1) for this reason. Languages based on grammars with a high value of k have traditionally been considered [ citation needed ] to be difficult to parse, although this is less true now given the availability and widespread use [ citation needed ] of parser generators supporting LL(k) grammars for arbitrary k.

See also

Notes

  1. Kernighan & Ritchie 1988, Appendix A.13 "Grammar", p.193 ff. The top image part shows a simplified excerpt in an EBNF-like notation..
  2. Rosenkrantz & Stearns (1970 , p. 227). Def.1. The authors do not consider the case k=0.
  3. where "" denotes derivability by leftmost derivations, and , , and
  4. Waite & Goos (1984 , p. 123) Def. 5.22
  5. Rosenkrantz & Stearns (1970 , p. 235) Def.2
  6. Rosenkrantz & Stearns (1970 , p. 235) Theorem 2
  7. Rosenkrantz & Stearns (1970 , p. 246–247): Using "" to denote "or", the string set has an , but no ε-free grammar, for each .
  8. Rosenkrantz & Stearns (1970 , pp. 254–255)
  9. Beatty (1982)
  10. Rosenkrantz & Stearns (1970 , pp. 241) Lemma 5
  11. Rosenkrantz & Stearns (1970 , p. 242) Theorem 4
  12. Poplawski, David (1977). "Properties of LL-Regular Languages". Purdue University.{{cite journal}}: Cite journal requires |journal= (help)
  13. 1 2 David A. Poplawski (Aug 1977). Properties of LL-Regular Languages (Technical Report). Purdue University, Department of Computer Science.
  14. 1 2 Korenjak & Hopcroft (1966)
  15. 1 2 Hopcroft & Ullman (1979 , p. 229) Exercise 9.3
  16. Rosenkrantz & Stearns (1970 , p. 243)

Sources

Further reading

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-sensitive language is a language that can be defined by a context-sensitive grammar. Context-sensitive is one of the four types of grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

<span class="mw-page-title-main">Automata theory</span> Study of abstract machines and automata

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means "self-acting, self-willed, self-moving". An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a Finite Automaton (FA) or Finite-State Machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Dyck language</span> Language consisting of balanced strings of brackets

In the theory of formal languages of computer science, mathematics, and linguistics, a Dyck word is a balanced string of brackets. The set of Dyck words forms a Dyck language. The simplest, D1, use just two matching brackets, e.g..

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In automata theory, a deterministic pushdown automaton is a variation of the pushdown automaton. The class of deterministic pushdown automata accepts the deterministic context-free languages, a proper subset of context-free languages.

<span class="mw-page-title-main">Terminal and nonterminal symbols</span> Categories of symbols in formal grammars

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

<span class="mw-page-title-main">Yield surface</span>

A yield surface is a five-dimensional surface in the six-dimensional space of stresses. The yield surface is usually convex and the state of stress of inside the yield surface is elastic. When the stress state lies on the surface the material is said to have reached its yield point and the material is said to have become plastic. Further deformation of the material causes the stress state to remain on the yield surface, even though the shape and size of the surface may change as the plastic deformation evolves. This is because stress states that lie outside the yield surface are non-permissible in rate-independent plasticity, though not in some models of viscoplasticity.

A read-only Turing machine or two-way deterministic finite-state automaton (2DFA) is class of models of computability that behave like a standard Turing machine and can move in both directions across input, except cannot write to its input tape. The machine in its bare form is equivalent to a deterministic finite automaton in computational power, and therefore can only parse a regular language.

An embedded pushdown automaton or EPDA is a computational model for parsing languages generated by tree-adjoining grammars (TAGs). It is similar to the context-free grammar-parsing pushdown automaton, but instead of using a plain stack to store symbols, it has a stack of iterated stacks that store symbols, giving TAGs a generative capacity between context-free and context-sensitive grammars, or a subset of mildly context-sensitive grammars. Embedded pushdown automata should not be confused with nested stack automata which have more computational power.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes how to form strings from an alphabet of a formal language that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In mathematics, a càdlàg, RCLL, or corlol function is a function defined on the real numbers that is everywhere right-continuous and has left limits everywhere. Càdlàg functions are important in the study of stochastic processes that admit jumps, unlike Brownian motion, which has continuous sample paths. The collection of càdlàg functions on a given domain is known as Skorokhod space.

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative of a set of strings and a string is the set of all strings obtainable from a string in by cutting off the prefix . Formally:

The three-state Potts CFT, also known as the parafermion CFT, is a conformal field theory in two dimensions. It is a minimal model with central charge . It is considered to be the simplest minimal model with a non-diagonal partition function in Virasoro characters, as well as the simplest non-trivial CFT with the W-algebra as a symmetry.