Ambiguous grammar

Last updated

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. [1] Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

Contents

For computer programming languages, the reference grammar is often ambiguous, due to issues such as the dangling else problem. If present, these ambiguities are generally resolved by adding precedence rules or other context-sensitive parsing rules, so the overall phrase grammar is unambiguous.[ citation needed ] Some parsing algorithms (such as (Earley [2] or GLR parsers) can generate sets of parse trees (or "parse forests") from strings that are syntactically ambiguous. [3]

Examples

Trivial language

The simplest example is the following ambiguous grammar (with start symbol A) for the trivial language that consists of only the empty string:

A → A | ε

…meaning that the nonterminal A can be derived to either itself again, or to the empty string. Thus the empty string has leftmost derivations of length 1, 2, 3, and indeed of any length, depending on how many times the rule A → A is used.

This language also has an unambiguous grammar, consisting of a single production rule:

A → ε

…meaning that the unique production can produce only the empty string, which is the unique string in the language.

In the same way, any grammar for a non-empty language can be made ambiguous by adding duplicates.

Unary string

The regular language of unary strings of a given character, say 'a' (the regular expression a*), has the unambiguous grammar:

A → aA | ε

…but also has the ambiguous grammar:

A → aA | Aa | ε

These correspond to producing a right-associative tree (for the unambiguous grammar) or allowing both left- and right- association. This is elaborated below.

Addition and subtraction

The context free grammar

A → A + A | A − A | a

is ambiguous since there are two leftmost derivations for the string a + a + a:

    A→ A + A    A→ A + A
    → a + A    → A + A + A (First A is replaced by A+A. Replacement of the second A would yield a similar derivation)
    → a + A + A    → a + A + A
    → a + a + A    → a + a + A
    → a + a + a    → a + a + a

As another example, the grammar is ambiguous since there are two parse trees for the string a + a a:

Leftmostderivations jaredwf.svg

The language that it generates, however, is not inherently ambiguous; the following is a non-ambiguous grammar generating the same language:

A → A + a | A − a | a

Dangling else

A common example of ambiguity in computer programming languages is the dangling else problem. In many languages, the else in an If–then(–else) statement is optional, which results in nested conditionals having multiple ways of being recognized in terms of the context-free grammar.

Concretely, in many languages one may write conditionals in two valid forms: the if-then form, and the if-then-else form – in effect, making the else clause optional: [note 1]

In a grammar containing the rules

Statement → if Condition then Statement |             if Condition then Statement else Statement |             ... Condition → ...

some ambiguous phrase structures can appear. The expression

if a thenif b then s else s2

can be parsed as either

if a thenbeginif b then s endelse s2

or as

if a thenbeginif b then s else s2 end

depending on whether the else is associated with the first if or second if.

This is resolved in various ways in different languages. Sometimes the grammar is modified so that it is unambiguous, such as by requiring an endif statement or making else mandatory. In other cases the grammar is left ambiguous, but the ambiguity is resolved by making the overall phrase grammar context-sensitive, such as by associating an else with the nearest if. In this latter case the grammar is unambiguous, but the context-free grammar is ambiguous.[ clarification needed ]

An unambiguous grammar with multiple derivations

The existence of multiple derivations of the same string does not suffice to indicate that the grammar is ambiguous; only multiple leftmost derivations (or, equivalently, multiple parse trees) indicate ambiguity.

For example, the simple grammar

S → A + A A → 0 | 1

is an unambiguous grammar for the language { 0+0, 0+1, 1+0, 1+1 }. While each of these four strings has only one leftmost derivation, it has two different derivations, for example

S  A + A ⇒ 0 + A ⇒ 0 + 0

and

S ⇒ A + A ⇒ A + 0 ⇒ 0 + 0

Only the former derivation is a leftmost one.

Recognizing ambiguous grammars

The decision problem of whether an arbitrary grammar is ambiguous is undecidable because it can be shown that it is equivalent to the Post correspondence problem. [4] At least, there are tools implementing some semi-decision procedure for detecting ambiguity of context-free grammars. [5]

The efficiency of parsing a context-free grammar is determined by the automaton that accepts it. Deterministic context-free grammars are accepted by deterministic pushdown automata and can be parsed in linear time, for example by an LR parser. [6] They are a strict subset of the context-free grammars, which are accepted by pushdown automata and can be parsed in polynomial time, for example by the CYK algorithm.

Unambiguous context-free grammars can be nondeterministic. For example, the language of even-length palindromes on the alphabet of 0 and 1 has the unambiguous context-free grammar S → 0S0 | 1S1 | ε. An arbitrary string of this language cannot be parsed without reading all its symbols first, which means that a pushdown automaton has to try alternative state transitions to accommodate for the different possible lengths of a semi-parsed string. [7]

Nevertheless, removing grammar ambiguity may produce a deterministic context-free grammar and thus allow for more efficient parsing. Compiler generators such as YACC include features for resolving some kinds of ambiguity, such as by using the precedence and associativity constraints.

Inherently ambiguous languages

While some context-free languages (the set of strings that can be generated by a grammar) have both ambiguous and unambiguous grammars, there exist context-free languages for which no unambiguous context-free grammar can exist. Such languages are called inherently ambiguous.

There are no inherently ambiguous regular languages. [8] [9]

The existence of inherently ambiguous context-free languages was proven with Parikh's theorem in 1961 by Rohit Parikh in an MIT research report. [10]

The language is inherently ambiguous. [11]

Ogden's lemma [12] can be used to prove that certain context-free languages, such as , are inherently ambiguous. See this page for a proof.

The union of with is inherently ambiguous. This set is context-free, since the union of two context-free languages is always context-free. But Hopcroft & Ullman (1979) give a proof that any context-free grammar for this union language cannot unambiguously parse strings of form . [13]

More examples, and a general review of techniques for proving inherent ambiguity of context-free languages, are found given by Bassino and Nicaud (2011). [14]

See also

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-sensitive language is a language that can be defined by a context-sensitive grammar. Context-sensitive is one of the four types of grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, a linear grammar is a context-free grammar that has at most one nonterminal in the right-hand side of each of its productions.

In automata theory, a deterministic pushdown automaton is a variation of the pushdown automaton. The class of deterministic pushdown automata accepts the deterministic context-free languages, a proper subset of context-free languages.

In formal language theory, deterministic context-free languages (DCFL) are a proper subset of context-free languages. They are the context-free languages that can be accepted by a deterministic pushdown automaton. DCFLs are always unambiguous, meaning that they admit an unambiguous grammar. There are non-deterministic unambiguous CFLs, so DCFLs form a proper subset of unambiguous CFLs.

In formal grammar theory, the deterministic context-free grammars (DCFGs) are a proper subset of the context-free grammars. They are the subset of context-free grammars that can be derived from deterministic pushdown automata, and they generate the deterministic context-free languages. DCFGs are always unambiguous, and are an important subclass of unambiguous CFGs; there are non-deterministic unambiguous CFGs, however.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

In theoretical computer science, in particular in formal language theory, Greibach's theorem states that certain properties of formal language classes are undecidable. It is named after the computer scientist Sheila Greibach, who first proved it in 1963.

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

References

  1. Willem J. M. Levelt (2008). An Introduction to the Theory of Formal Languages and Automata. John Benjamins Publishing. ISBN   978-90-272-3250-2.
  2. Scott, Elizabeth (April 1, 2008). "SPPF-Style Parsing From Earley Recognizers". Electronic Notes in Theoretical Computer Science. 203 (2): 53–67. doi: 10.1016/j.entcs.2008.03.044 .
  3. Tomita, Masaru. "An efficient augmented-context-free parsing algorithm." Computational linguistics 13.1-2 (1987): 31-46.
  4. Hopcroft, John; Motwani, Rajeev; Ullman, Jeffrey (2001). Introduction to automata theory, languages, and computation (2nd ed.). Addison-Wesley. Theorem 9.20, pp. 405–406. ISBN   0-201-44124-1.
  5. Axelsson, Roland; Heljanko, Keijo; Lange, Martin (2008). "Analyzing Context-Free Grammars Using an Incremental SAT Solver" (PDF). Proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP'08), Reykjavik, Iceland. Lecture Notes in Computer Science. Vol. 5126. Springer-Verlag. pp. 410–422. doi:10.1007/978-3-540-70583-3_34. ISBN   978-3-540-70582-6.
  6. Knuth, D. E. (July 1965). "On the translation of languages from left to right". Information and Control. 8 (6): 607–639. doi:10.1016/S0019-9958(65)90426-2.
  7. Hopcroft, John; Motwani, Rajeev; Ullman, Jeffrey (2001). Introduction to automata theory, languages, and computation (2nd ed.). Addison-Wesley. pp. 249–253. ISBN   0-201-44124-1.
  8. Book, R.; Even, S.; Greibach, S.; Ott, G. (Feb 1971). "Ambiguity in Graphs and Expressions". IEEE Transactions on Computers. C-20 (2): 149–153. doi:10.1109/t-c.1971.223204. ISSN   0018-9340. S2CID   20676251.
  9. "formal languages - Can regular expressions be made unambiguous?". MathOverflow. Retrieved 2023-02-23.
  10. Parikh, Rohit (January 1961). Language-generating devices. Quarterly Progress Report, Research Laboratory of Electronics, MIT.
  11. Parikh, Rohit J. (1966-10-01). "On Context-Free Languages". Journal of the ACM. 13 (4): 570–581. doi: 10.1145/321356.321364 . ISSN   0004-5411. S2CID   12263468. Here: Theorem 3.
  12. Ogden, William (Sep 1968). "A helpful result for proving inherent ambiguity". Mathematical Systems Theory. 2 (3): 191–194. doi:10.1007/bf01694004. ISSN   0025-5661. S2CID   13197551.
  13. p.99-103, Sect.4.7
  14. Fredérique Bassino and Cyril Nicaud (December 16, 2011). "Philippe Flajolet & Analytic Combinatorics: Inherent Ambiguity of Context-Free Languages" (PDF). Archived (PDF) from the original on 2022-09-25.

Notes

  1. The following example uses Pascal syntax