Context-sensitive grammar

Last updated

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general (in the same sense) than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy. [1]

Contents

A formal language that can be described by a context-sensitive grammar, or, equivalently, by a noncontracting grammar or a linear bounded automaton, is called a context-sensitive language. Some textbooks actually define CSGs as non-contracting, [2] [3] [4] [5] although this is not how Noam Chomsky defined them in 1959. [6] [7] This choice of definition makes no difference in terms of the languages generated (i.e. the two definitions are weakly equivalent), but it does make a difference in terms of what grammars are structurally considered context-sensitive; the latter issue was analyzed by Chomsky in 1963. [8] [9]

Chomsky introduced context-sensitive grammars as a way to describe the syntax of natural language where it is often the case that a word may or may not be appropriate in a certain place depending on the context. Walter Savitch has criticized the terminology "context-sensitive" as misleading and proposed "non-erasing" as better explaining the distinction between a CSG and an unrestricted grammar. [10]

Although it is well known that certain features of languages (e.g. cross-serial dependency) are not context-free, it is an open question how much of CSGs' expressive power is needed to capture the context sensitivity found in natural languages. Subsequent research in this area has focused on the more computationally tractable mildly context-sensitive languages.[ citation needed ] The syntaxes of some visual programming languages can be described by context-sensitive graph grammars. [11]

Formal definition

Formal grammar

Let us notate a formal grammar as , with a set of nonterminal symbols, a set of terminal symbols, a set of production rules, and the start symbol.

A string directly yields, or directly derives to, a string , denoted as , if v can be obtained from u by an application of some production rule in P, that is, if and , where is a production rule, and is the unaffected left and right part of the string, respectively. More generally, u is said to yield, or derive to, v, denoted as , if v can be obtained from u by repeated application of production rules, that is, if for some n ≥ 0 and some strings . In other words, the relation is the reflexive transitive closure of the relation .

The language of the grammar G is the set of all terminal-symbol strings derivable from its start symbol, formally: . Derivations that do not end in a string composed of terminal symbols only are possible, but do not contribute to L(G).

Context-sensitive grammar

A formal grammar is context-sensitive if each rule in P is either of the form where is the empty string, or of the form

αAβ → αγβ

with AN, [note 1] , [note 2] and . [note 3]

The name context-sensitive is explained by the α and β that form the context of A and determine whether A can be replaced with γ or not. By contrast, in a context-free grammar, no context is present: the left hand side of every production rule is just a nonterminal.

The string γ is not allowed to be empty. Without this restriction, the resulting grammars become equal in power to unrestricted grammars. [10]

(Weakly) equivalent definitions

A noncontracting grammar is a grammar in which for any production rule, of the form uv, the length of u is less than or equal to the length of v.

Every context-sensitive grammar is noncontracting, while every noncontracting grammar can be converted into an equivalent context-sensitive grammar; the two classes are weakly equivalent. [12]

Some authors use the term context-sensitive grammar to refer to noncontracting grammars in general.

The left-context- and right-context-sensitive grammars are defined by restricting the rules to just the form αA → αγ and to just Aβ → γβ, respectively. The languages generated by these grammars are also the full class of context-sensitive languages. [13] The equivalence was established by Penttonen normal form. [14]

Examples

anbncn

The following context-sensitive grammar, with start symbol S, generates the canonical non-context-free language { anbncn | n ≥ 1 } :[ citation needed ]

1.    S    aBC
2.SaSBC
3.CBCZ
4.CZWZ
5.WZWC
6.WCBC
7.aBab
8.bBbb
9.bCbc
10.cCcc

Rules 1 and 2 allow for blowing-up S to anBC(BC)n−1; rules 3 to 6 allow for successively exchanging each CB to BC (four rules are needed for that since a rule CBBC wouldn't fit into the scheme αAβ → αγβ); rules 7–10 allow replacing a non-terminals B and C with its corresponding terminals b and c respectively, provided it is in the right place. A generation chain for aaabbbccc is:

S
2aSBC
2aaSBCBC
1aaaBCBCBC
3aaaBCZCBC
4aaaBWZCBC
5aaaBWCCBC
6aaaBBCCBC
3aaaBBCCZC
4aaaBBCWZC
5aaaBBCWCC
6aaaBBCBCC
3aaaBBCZCC
4aaaBBWZCC
5aaaBBWCCC
6aaaBBBCCC
7aaabBBCCC
8aaabbBCCC
8aaabbbCCC
9aaabbbcCC
10aaabbbccC
10aaabbbccc

anbncndn, etc.

More complicated grammars can be used to parse { anbncndn | n ≥ 1 }, and other languages with even more letters. Here we show a simpler approach using non-contracting grammars:[ citation needed ] Start with a kernel of regular productions generating the sentential forms and then include the non contracting productions , , , , , , , , , .

ambncmdn

A non contracting grammar (for which there is an equivalent CSG) for the language is defined by

,
,
,
,
,
,
, and
.

With these definitions, a derivation for is: .[ citation needed ]

a2i

A noncontracting grammar for the language { a2i | i ≥ 1 } is constructed in Example 9.5 (p. 224) of (Hopcroft, Ullman, 1979): [15]

Kuroda normal form

Every context-sensitive grammar which does not generate the empty string can be transformed into a weakly equivalent one in Kuroda normal form. "Weakly equivalent" here means that the two grammars generate the same language. The normal form will not in general be context-sensitive, but will be a noncontracting grammar. [16] [17]

The Kuroda normal form is an actual normal form for non-contracting grammars.

Properties and uses

Equivalence to linear bounded automaton

A formal language can be described by a context-sensitive grammar if and only if it is accepted by some linear bounded automaton (LBA). [18] In some textbooks this result is attributed solely to Landweber and Kuroda. [7] Others call it the Myhill–Landweber–Kuroda theorem. [19] (Myhill introduced the concept of deterministic LBA in 1960. Peter S. Landweber published in 1963 that the language accepted by a deterministic LBA is context sensitive. [20] Kuroda introduced the notion of non-deterministic LBA and the equivalence between LBA and CSGs in 1964. [21] [22] )

As of 2010[ needs update ] it is still an open question whether every context-sensitive language can be accepted by a deterministic LBA. [23]

Closure properties

Context-sensitive languages are closed under complement. This 1988 result is known as the Immerman–Szelepcsényi theorem. [19] Moreover, they are closed under union, intersection, concatenation, substitution, [note 4] inverse homomorphism, and Kleene plus. [24]

Every recursively enumerable language L can be written as h(L) for some context-sensitive language L and some string homomorphism h. [25]

Computational problems

The decision problem that asks whether a certain string s belongs to the language of a given context-sensitive grammar G, is PSPACE-complete. Moreover, there are context-sensitive grammars whose languages are PSPACE-complete. In other words, there is a context-sensitive grammar G such that deciding whether a certain string s belongs to the language of G is PSPACE-complete (so G is fixed and only s is part of the input of the problem). [26]

The emptiness problem for context-sensitive grammars (given a context-sensitive grammar G, is L(G)=∅ ?) is undecidable. [27] [note 5]

As model of natural languages

Savitch has proven the following theoretical result, on which he bases his criticism of CSGs as basis for natural language: for any recursively enumerable set R, there exists a context-sensitive language/grammar G which can be used as a sort of proxy to test membership in R in the following way: given a string s, s is in R if and only if there exists a positive integer n for which scn is in G, where c is an arbitrary symbol not part of R. [10]

It has been shown that nearly all natural languages may in general be characterized by context-sensitive grammars, but the whole class of CSGs seems to be much bigger than natural languages.[ citation needed ] Worse yet, since the aforementioned decision problem for CSGs is PSPACE-complete, that makes them totally unworkable for practical use, as a polynomial-time algorithm for a PSPACE-complete problem would imply P=NP.

It was proven that some natural languages are not context-free, based on identifying so-called cross-serial dependencies and unbounded scrambling phenomena.[ citation needed ] However this does not necessarily imply that the class of CSGs is necessary to capture "context sensitivity" in the colloquial sense of these terms in natural languages. For example, linear context-free rewriting systems (LCFRSs) are strictly weaker than CSGs but can account for the phenomenon of cross-serial dependencies; one can write a LCFRS grammar for {anbncndn | n ≥ 1} for example. [28] [29] [30]

Ongoing research on computational linguistics has focused on formulating other classes of languages that are "mildly context-sensitive" whose decision problems are feasible, such as tree-adjoining grammars, combinatory categorial grammars, coupled context-free languages, and linear context-free rewriting systems. The languages generated by these formalisms properly lie between the context-free and context-sensitive languages.

More recently, the class PTIME has been identified with range concatenation grammars, which are now considered to be the most expressive of the mild-context sensitive language classes. [30]

See also

Notes

  1. i.e., A a single nonterminal
  2. i.e., α and β strings of nonterminals (except for the start symbol) and terminals
  3. i.e., γ is a nonempty string of nonterminals (except for the start symbol) and terminals
  4. more formally: if L ⊆ Σ* is a context-sensitive language and f maps each a∈Σ to a context-sensitive language f(a), the f(L) is again a context-sensitive language
  5. This also follows from (1) context-free languages being also context-sensitive, (2) context-sensitive language being closed under intersection, but (3) disjointness of context-free languages being undecidable.

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

In formal language theory, a context-sensitive language is a language that can be defined by a context-sensitive grammar. Context-sensitive is one of the four types of grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In mathematical logic and computer science, the Kleene star is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid construction. The application of the Kleene star to a set is written as . It is widely used for regular expressions, which is the context in which it was introduced by Stephen Kleene to characterize certain automata, where it means "zero or more repetitions".

  1. If is a set of strings, then is defined as the smallest superset of that contains the empty string and is closed under the string concatenation operation.
  2. If is a set of symbols or characters, then is the set of all strings over symbols in , including the empty string .
<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

In computer science, in particular in formal language theory, the pumping lemma for context-free languages, also known as the Bar-Hillel lemma, is a lemma that gives a property shared by all context-free languages and generalizes the pumping lemma for regular languages.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol.

An embedded pushdown automaton or EPDA is a computational model for parsing languages generated by tree-adjoining grammars (TAGs). It is similar to the context-free grammar-parsing pushdown automaton, but instead of using a plain stack to store symbols, it has a stack of iterated stacks that store symbols, giving TAGs a generative capacity between context-free and context-sensitive grammars, or a subset of mildly context-sensitive grammars. Embedded pushdown automata should not be confused with nested stack automata which have more computational power.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

In formal language theory, a grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

References

  1. (Hopcroft, Ullman, 1979); Sect.9.4, p.227
  2. Linz, Peter (2011). An Introduction to Formal Languages and Automata. Jones & Bartlett Publishers. p. 291. ISBN   978-1-4496-1552-9.
  3. Meduna, Alexander (2000). Automata and Languages: Theory and Applications. Springer Science & Business Media. p. 730. ISBN   978-1-85233-074-3.
  4. Davis, Martin; Sigal, Ron; Weyuker, Elaine J. (1994). Computability, Complexity, and Languages: Fundamentals of Theoretical Computer Science (2nd ed.). Morgan Kaufmann. p. 189. ISBN   978-0-08-050246-5.
  5. Martin, John C. (2010). Introduction to Languages and the Theory of Computation (4th ed.). New York, NY: McGraw-Hill. p. 277. ISBN   9780073191461.
  6. Levelt, Willem J. M. (2008). An Introduction to the Theory of Formal Languages and Automata. John Benjamins Publishing. p. 26. ISBN   978-90-272-3250-2.
  7. 1 2 Davis, Martin; Sigal, Ron; Weyuker, Elaine J. (1994). Computability, Complexity, and Languages: Fundamentals of Theoretical Computer Science (2nd ed.). Morgan Kaufmann. pp. 330–331. ISBN   978-0-08-050246-5.
  8. Chomsky, N. (1963). "Formal properties of grammar". In Luce, R. D.; Bush, R. R.; Galanter, E. (eds.). Handbook of Mathematical Psychology. New York: Wiley. pp. 360–363.
  9. Levelt, Willem J. M. (2008). An Introduction to the Theory of Formal Languages and Automata. John Benjamins Publishing. pp. 125–126. ISBN   978-90-272-3250-2.
  10. 1 2 3 Carlos Martín Vide, ed. (1999). Issues in Mathematical Linguistics: Workshop on Mathematical Linguistics, State College, Pa., April 1998. John Benjamins Publishing. pp. 186–187. ISBN   90-272-1556-1.
  11. Zhang, Da-Qian, Kang Zhang, and Jiannong Cao. "A context-sensitive graph grammar formalism for the specification of visual languages." The Computer Journal 44.3 (2001): 186–200.
  12. Hopcroft, John E.; Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages, and Computation . Addison-Wesley. ISBN   9780201029888.; p. 223–224; Exercise 9, p. 230. In the 2003 edition, the chapter on CSGs has been omitted.
  13. Hazewinkel, Michiel (1989). Encyclopaedia of Mathematics. Vol. 4. Springer Science & Business Media. p. 297. ISBN   978-1-55608-003-6. also at https://www.encyclopediaofmath.org/index.php/Grammar,_context-sensitive
  14. Ito, Masami; Kobayashi, Yūji; Shoji, Kunitaka (2010). Automata, Formal Languages and Algebraic Systems: Proceedings of AFLAS 2008, Kyoto, Japan, 20–22 September 2008. World Scientific. p. 183. ISBN   978-981-4317-60-3. citing Penttonen, Martti (Aug 1974). "One-sided and two-sided context in formal grammars". Information and Control . 25 (4): 371–392. doi: 10.1016/S0019-9958(74)91049-3 .
  15. They obtained the grammar by systematic transformation of an unrestricted grammar, given in Exm. 9.4, viz.:
    1. ,
    2. ,
    3. ,
    4. ,
    5. ,
    6. ,
    7. ,
    8. .
    In the context-sensitive grammar, a string enclosed in square brackets, like , is considered a single symbol (similar to e.g. <name-part> in Backus–Naur form). The symbol names are chosen to resemble the unrestricted grammar. Likewise, rule groups in the context-sensitive grammar are numbered by the unrestricted-grammar rule they originated from.
  16. Kuroda, Sige-Yuki (June 1964). "Classes of languages and linear-bounded automata". Information and Control. 7 (2): 207–223. doi: 10.1016/s0019-9958(64)90120-2 .
  17. Mateescu, Alexandru; Salomaa, Arto (1997). "Chapter 4: Aspects of Classical Language Theory". In Rozenberg, Grzegorz; Salomaa, Arto (eds.). Handbook of Formal Languages. Volume I: Word, language, grammar. Springer-Verlag. pp. 175–252. ISBN   3-540-61486-9., Here: Theorem 2.2, p. 190
  18. (Hopcroft, Ullman, 1979); Theorem 9.5, 9.6, p. 225–226
  19. 1 2 Sutner, Klaus (Spring 2016). "Context Sensitive Grammars" (PDF). Carnegie Mellon University. Archived from the original (PDF) on 2017-02-03. Retrieved 2019-08-29.
  20. P.S. Landweber (1963). "Three Theorems on Phrase Structure Grammars of Type 1". Information and Control . 6 (2): 131–136. doi:10.1016/s0019-9958(63)90169-4.
  21. Meduna, Alexander (2000). Automata and Languages: Theory and Applications. Springer Science & Business Media. p. 755. ISBN   978-1-85233-074-3.
  22. Levelt, Willem J. M. (2008). An Introduction to the Theory of Formal Languages and Automata. John Benjamins Publishing. pp. 126–127. ISBN   978-90-272-3250-2.
  23. Martin, John C. (2010). Introduction to Languages and the Theory of Computation (4th ed.). New York, NY: McGraw-Hill. p. 283. ISBN   9780073191461.
  24. (Hopcroft, Ullman, 1979); Exercise S9.10, p. 230–231
  25. (Hopcroft, Ullman, 1979); Exercise S9.14, p. 230–232. h maps each symbol to itself or to the empty string.
  26. An example of such a grammar, designed to solve the QSAT problem, is given in Lita, C. V. (2016-09-01). "On Complexity of the Detection Problem for Bounded Length Polymorphic Viruses". 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). pp. 371–378. doi:10.1109/SYNASC.2016.064. ISBN   978-1-5090-5707-8. S2CID   18067130.
  27. (Hopcroft, Ullman, 1979); Exercise S9.13, p. 230–231
  28. Kallmeyer, Laura (2011). "Mildly Context-Sensitive Grammar Formalisms: Natural Languages are not Context-Free" (PDF). Archived (PDF) from the original on 2014-08-19.
  29. Kallmeyer, Laura (2011). "Mildly Context-Sensitive Grammar Formalisms: Linear Context-Free Rewriting Systems" (PDF). Archived (PDF) from the original on 2014-08-19.
  30. 1 2 Kallmeyer, Laura (2010). Parsing Beyond Context-Free Grammars. Springer Science & Business Media. pp. 1–5. ISBN   978-3-642-14846-0.

Further reading