Indexed grammar

Last updated

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language .

Contents

Definition

Modern definition by Hopcroft and Ullman

In contemporary publications following Hopcroft and Ullman (1979), [2] an indexed grammar is formally defined a 5-tuple G = ⟨N,T,F,P,S⟩ where

In productions as well as in derivations of indexed grammars, a string ("stack") σF * of index symbols is attached to every nonterminal symbol AN, denoted by A[σ]. [note 1] Terminal symbols may not be followed by index stacks. For an index stack σF* and a string α ∈ (NT)* of nonterminal and terminal symbols, α[σ] denotes the result of attaching [σ] to every nonterminal in α; for example if α equals aBCdE with a,dT terminal, and B,C,EN nonterminal symbols, then α[σ] denotes aB[σ] C[σ] dE[σ]. Using this notation, each production in P has to be of the form

  1. A[σ] → α[σ],
  2. A[σ] → B[fσ], or
  3. A[fσ] → α[σ],

where A, BN are nonterminal symbols, fF is an index, σF* is a string of index symbols, and α ∈ (NT)* is a string of nonterminal and terminal symbols. Some authors write ".." instead of "σ" for the index stack in production rules; the rule of type 1, 2, and 3 then reads A[..]→α[..],  A[..]→B[f..], and A[f..]→α[..], respectively.

Derivations are similar to those in a context-free grammar except for the index stack attached to each nonterminal symbol. When a production like e.g. A[σ] → B[σ]C[σ] is applied, the index stack of A is copied to both B and C. Moreover, a rule can push an index symbol onto the stack, or pop its "topmost" (i.e., leftmost) index symbol.

Formally, the relation ⇒ ("direct derivation") is defined on the set (N[F*]∪T)* of "sentential forms" as follows:

  1. If A[σ] → α[σ] is a production of type 1, then β A[φ] γβα[φ] γ, using the above definition. That is, the rule's left hand side's index stack φ is copied to each nonterminal of the right hand side.
  2. If A[σ] → B[] is a production of type 2, then βA[φ] γβB[] γ. That is, the right hand side's index stack is obtained from the left hand side's stack φ by pushing f onto it.
  3. If A[] → α[σ] is a production of type 3, then βA[] γβα[φ] γ, using again the definition of α[σ]. That is, the first index f is popped from the left hand side's stack, which is then distributed to each nonterminal of the right hand side.

As usual, the derivation relation is defined as the reflexive transitive closure of direct derivation ⇒. The language L(G) = { wT*: Sw } is the set of all strings of terminal symbols derivable from the start symbol.

Original definition by Aho

Historically, the concept of indexed grammars was first introduced by Alfred Aho (1968) [3] using a different formalism. Aho defined an indexed grammar to be a 5-tuple (N,T,F,P,S) where

  1. N is a finite alphabet of variables or nonterminal symbols
  2. T is a finite alphabet of terminal symbols
  3. F2 N × (NT)* is the finite set of so-called flags (each flag is itself a set of so-called index productions)
  4. PN × (NF*T)* is the finite set of productions
  5. SN is the start symbol

Direct derivations were as follows:

This formalism is e.g. used by Hayashi (1973, p. 65-66). [4]

Examples

In practice, stacks of indices can count and remember what rules were applied and in which order. For example, indexed grammars can describe the context-sensitive language of word triples { www : w ∈ {a,b}* }:

S[σ]S[]T[]aT[σ]
S[σ]S[]T[]bT[σ]
S[σ]T[σ] T[σ] T[σ]    T[]ε

A derivation of abbabbabb is then

S[]S[g]S[gg]S[fgg]T[fgg] T[fgg] T[fgg]aT[gg] T[fgg] T[fgg]abT[g] T[fgg] T[fgg]abbT[] T[fgg] T[fgg]abbT[fgg] T[fgg]...abbabbT[fgg]...abbabbabb.

As another example, the grammar G = ⟨ {S,T,A,B,C}, {a,b,c}, {f,g}, P, S ⟩ produces the language { anbncn: n ≥ 1 }, where the production set P consists of

S[σ] → T[]A[] → aA[σ]A[] → a
T[σ] → T[]B[] → bB[σ]B[] → b
T[σ] → A[σ] B[σ] C[σ]    C[] → cC[σ]    C[] → c

An example derivation is

S[]T[g]T[fg]A[fg] B[fg] C[fg]aA[g] B[fg] C[fg]aA[g] bB[g] C[fg]aA[g] bB[g] cC[g]aabB[g] cC[g]aabbcC[g]aabbcc.

Both example languages are not context-free by the pumping lemma.

Properties

Hopcroft and Ullman tend to consider indexed languages as a "natural" class, since they are generated by several formalisms other than indexed grammars, viz. [5]

Hayashi [4] generalized the pumping lemma to indexed grammars. Conversely, Gilman [10] [11] gives a "shrinking lemma" for indexed languages.

Linear indexed grammars

Gerald Gazdar has defined a second class, the linear indexed grammars (LIG), [14] by requiring that at most one nonterminal in each production be specified as receiving the stack, [note 2] whereas in an ordinary indexed grammar, all nonterminals receive copies of the stack. Formally, a linear indexed grammar is defined similar to an ordinary indexed grammar, but the production's form requirements are modified to:

  1. A[σ] → α[] B[σ] β[],
  2. A[σ] → α[] B[] β[],
  3. A[] → α[] B[σ] β[],

where A, B, f, σ, α are used as above, and β ∈ (NT)* is a string of nonterminal and terminal symbols like α. [note 3] Also, the direct derivation relation ⇒ is defined similar to above. This new class of grammars defines a strictly smaller class of languages, [15] which belongs to the mildly context-sensitive classes.

The language { www : w ∈ {a,b}* } is generable by an indexed grammar, but not by a linear indexed grammar, while both { ww : w ∈ {a,b}* } and { anbncn : n ≥ 1 } are generable by a linear indexed grammar.

If both the original and the modified production rules are admitted, the language class remains the indexed languages. [16]

Example

Letting σ denote an arbitrary sequence of stack symbols, we can define a grammar for the language L = {anbncn | n ≥ 1 } [note 4] as

S[σ]a S[] c
S[σ]T[σ]
T[]T[σ] b
T[]ε

To derive the string abc we have the steps:

S[] ⇒ aS[f]caT[f]caT[]bcabc

Similarly:

S[] ⇒ aS[f]caaS[ff]ccaaT[ff]ccaaT[f]bccaaT[]bbccaabbcc

Computational power

The linearly indexed languages are a subset of the indexed languages, and thus all LIGs can be recoded as IGs, making the LIGs strictly less powerful than the IGs. A conversion from a LIG to an IG is relatively simple. [17] LIG rules in general look approximately like , modulo the push/pop part of a rewrite rule. The symbols and represent strings of terminal and/or non-terminal symbols, and any non-terminal symbol in either must have an empty stack, by the definition of a LIG. This is, of course, counter to how IGs are defined: in an IG, the non-terminals whose stacks are not being pushed to or popped from must have exactly the same stack as the rewritten non-terminal. Thus, somehow, we need to have non-terminals in and which, despite having non-empty stacks, behave as if they had empty stacks.

Consider the rule as an example case. In converting this to an IG, the replacement for must be some that behaves exactly like regardless of what is. To achieve this, we can simply have a pair of rules that takes any where is not empty, and pops symbols from the stack. Then, when the stack is empty, it can be rewritten as .

We can apply this in general to derive an IG from an LIG. So for example if the LIG for the language is as follows:

The sentential rule here is not an IG rule, but using the above conversion algorithm, we can define new rules for , changing the grammar to:

Each rule now fits the definition of an IG, in which all the non-terminals in the right hand side of a rewrite rule receive a copy of the rewritten symbol's stack. The indexed grammars are therefore able to describe all the languages that linearly indexed grammars can describe.

Relation to other formalisms

Vijay-Shanker and Weir (1994) [18] demonstrates that Linear Indexed Grammars, Combinatory Categorial Grammars, Tree-adjoining Grammars, and Head Grammars all define the same class of string languages. Their formal definition of linear indexed grammars [19] differs from the above.[ clarification needed ]

LIGs (and their weakly equivalents) are strictly less expressive (meaning they generate a proper subset) than the languages generated by another family of weakly equivalent formalism, which include: LCFRS, MCTAG, MCFG and minimalist grammars (MGs). The latter family can (also) be parsed in polynomial time. [20]

Distributed index grammars

Another form of indexed grammars, introduced by Staudacher (1993), [12] is the class of Distributed Index grammars (DIGs). What distinguishes DIGs from Aho's Indexed Grammars is the propagation of indexes. Unlike Aho's IGs, which distribute the whole symbol stack to all non-terminals during a rewrite operation, DIGs divide the stack into substacks and distributes the substacks to selected non-terminals.

The general rule schema for a binarily distributing rule of DIG is the form

X[f1...fifi+1...fn] → αY[f1...fi] βZ[fi+1...fn] γ

Where α, β, and γ are arbitrary terminal strings. For a ternarily distributing string:

X[f1...fifi+1...fjfj+1...fn] → αY[f1...fi] βZ[fi+1...fj] γW[fj+1...fn] η

And so forth for higher numbers of non-terminals in the right hand side of the rewrite rule. In general, if there are m non-terminals in the right hand side of a rewrite rule, the stack is partitioned m ways and distributed amongst the new non-terminals. Notice that there is a special case where a partition is empty, which effectively makes the rule a LIG rule. The Distributed Index languages are therefore a superset of the Linearly Indexed languages.

See also

Notes

  1. "[" and "]" are meta symbols to indicate the stack.
  2. all other nonterminals receive an empty stack
  3. 1 2 In order to generate any string at all, some productions must be admitted having no nonterminal symbol on their right hand side. However, Gazdar didn't discuss this issue.
  4. Cf. the properly indexed grammar for the same language given above. The last rule, viz. T[]→ε, of the linear indexed grammar doesn't conform to Gazdar's definition in a strict sense, cf. [note 3]

Related Research Articles

In formal language theory, computer science and linguistics, the Chomsky hierarchy is a containment hierarchy of classes of formal grammars.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by CSG but not by context-free grammars. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSG are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

Context-free grammar Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

Pushdown automaton Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

Boolean grammars, introduced by Okhotin, are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boolean grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction and negation can be used, in particular, to specify intersection and complement of languages. An intermediate class of grammars known as conjunctive grammars allows conjunction and disjunction, but not negation.

In automata theory, the class of unrestricted grammars is the most general class of grammars in the Chomsky hierarchy. No restrictions are made on the productions of an unrestricted grammar, other than each of their left-hand sides being non-empty. This grammar class can generate arbitrary recursively enumerable languages.

In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined by a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

In formal language theory, a grammar is noncontracting if all of its production rules are of the form α → β where α and β are strings of nonterminal and terminal symbols, and the length of α is less than or equal to that of β, |α| ≤ |β|, that is β is not shorter than α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol.

Normal-inverse-gamma distribution

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In formal language theory, a grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

Head grammar (HG) is a grammar formalism introduced in Carl Pollard (1984) as an extension of the context-free grammar class of grammars. Head grammar is therefore a type of phrase structure grammar, as opposed to a dependency grammar. The class of head grammars is a subset of the linear context-free rewriting systems.

LL grammar Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

In mathematics, Ricci calculus constitutes the rules of index notation and manipulation for tensors and tensor fields on a differentiable manifold, with or without a metric tensor or connection. It is also the modern name for what used to be called the absolute differential calculus, developed by Gregorio Ricci-Curbastro in 1887–1896, and subsequently popularized in a paper written with his pupil Tullio Levi-Civita in 1900. Jan Arnoldus Schouten developed the modern notation and formalism for this mathematical framework, and made contributions to the theory, during its applications to general relativity and differential geometry in the early twentieth century.

References

  1. 1 2 Hopcroft, John E.; Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. ISBN   978-0-201-02988-8.
  2. Hopcroft and Ullman (1979), [1] Sect.14.3, p.389-390. This section is omitted in the 2nd edition 2003.
  3. Aho, Alfred (1968). "Indexed grammars—an extension of context-free grammars". Journal of the ACM . 15 (4): 647–671. doi:10.1145/321479.321488. S2CID   9539666.
  4. 1 2 Hayashi, Takeshi (1973). "On derivation trees of indexed grammars: An extension of the uvwxy-theorem". Publications of the Research Institute for Mathematical Sciences. 9: 61–92. doi: 10.2977/prims/1195192738 .
  5. Hopcroft and Ullman (1979), [1] Bibliographic notes, p.394-395
  6. Alfred Aho (1969). "Nested Stack Automata". Journal of the ACM. 16 (3): 383–406. doi:10.1145/321526.321529. S2CID   685569.
  7. Michael J. Fischer (1968). "Grammars with Macro-Like Productions". Proc. 9th Ann. IEEE Symp. on Switching and Automata Theory (SWAT). pp. 131–142. doi:10.1109/SWAT.1968.12.
  8. Sheila A. Greibach (1970). "Full AFL's and Nested Iterated Substitution". Information and Control. 16 (1): 7–35. doi: 10.1016/s0019-9958(70)80039-0 .
  9. T.S.E. Maibaum (1974). "A Generalized Approach to Formal Languages". Journal of Computer and System Sciences. 8 (3): 409–439. doi: 10.1016/s0022-0000(74)80031-0 .
  10. Robert H. Gilman (1996). "A Shrinking Lemma for Indexed Languages". Theoretical Computer Science. 163 (1–2): 277–281. arXiv: math/9509205 . doi:10.1016/0304-3975(96)00244-7. S2CID   14479068.
  11. Robert H. Gilman (Sep 1995). "A Shrinking Lemma for Indexed Languages". arXiv: math/9509205 .
  12. 1 2 Staudacher, Peter (1993), "New frontiers beyond context-freeness: DI-grammars (DIGs) and DI-automata." (PDF), Sixth Conference of the European Chapter of the Association for Computational Linguistics (EACL '93), pp. 358–367
  13. David J. Weir; Aravind K. Joshi (1988). "Combinatory Categorial Grammars: Generative Power and Relationship to Linear Context-Free Rewriting Systems" (PDF). Proc. 26th Meeting Assoc. Comput. Ling. pp. 278–285.
  14. According to Staudacher (1993, p.361 left, Sect.2.2), [12] the name "linear indexed grammars" wasn't used in Gazdar's 1988 paper, but appeared later, e.g. in Weir and Joshi (1988). [13]
  15. Gazdar, Gerald (1988). "Applicability of Indexed Grammars to Natural Languages". In U. Reyle and C. Rohrer (ed.). Natural Language Parsing and Linguistic Theories. Studies in linguistics and philosophy. Vol. 35. D. Reidel Publishing Company. pp. 69–94. ISBN   978-1-55608-055-5.
  16. Gazdar (1988), Appendix, p.89
  17. Gazdar 1988, Appendix, p.89-91
  18. Vijay-Shanker, K.; Weir, David J. 1994. (1994). "The Equivalence of Four Extensions of Context-Free Grammars". Mathematical Systems Theory. 27 (6): 511–546. doi:10.1007/bf01191624. S2CID   12336597.
  19. p.517-518
  20. Johan F.A.K. van Benthem; Alice ter Meulen (2010). Handbook of Logic and Language (2nd ed.). Elsevier. p. 404. ISBN   978-0-444-53727-0.