Conjunctive grammar

Last updated

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

Contents

The rules of a conjunctive grammar are of the form

where is a nonterminal and , ..., are strings formed of symbols in and (finite sets of terminal and nonterminal symbols respectively). Informally, such a rule asserts that every string over that satisfies each of the syntactical conditions represented by , ..., therefore satisfies the condition defined by .

Formal definition

A conjunctive grammar is defined by the 4-tuple where

  1. V is a finite set; each element is called a nonterminal symbol or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories.
  2. Σ is a finite set of terminals, disjoint from V, which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar G.
  3. R is a finite set of productions, each of the form for some in and . The members of R are called the rules or productions of the grammar.
  4. S is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of V.

It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Rules and can hence be written as .

Two equivalent formal definitions of the language specified by a conjunctive grammar exist. One definition is based upon representing the grammar as a system of language equations with union, intersection and concatenation and considering its least solution. The other definition generalizes Chomsky's generative definition of the context-free grammars using rewriting of terms over conjunction and concatenation.

Definition by derivation

For any strings , we say u directly yields v, written as , if

For any string we say Ggeneratesw, written as , if such that .

The language of a grammar is the set of all strings it generates.

Example

The grammar , with productions

,
,
,
,
,

is conjunctive. A typical derivation is

It can be shown that . The language is not context-free, proved by the pumping lemma for context-free languages.

Parsing algorithms

Though the expressive power of conjunctive grammars is greater than those of context-free grammars, conjunctive grammars retain some of the latter. Most importantly, there are generalizations of the main context-free parsing algorithms, including the linear-time recursive descent, the cubic-time generalized LR, the cubic-time Cocke-Kasami-Younger, as well as Valiant's algorithm running as fast as matrix multiplication.

Theoretical properties

A property that is undecidable already for context-free languages or finite intersections of them, must be undecidable also for conjunctive grammars; these include: emptiness, finiteness, regularity, context-freeness, [n 1] inclusion and equivalence. [n 2]

The family of conjunctive languages is closed under union, intersection, concatenation and Kleene star, but not under string homomorphism, prefix, suffix, and substring. Closure under complement and under ε-free string homomorphism are still open problems (as of 2001). [1] :533

The expressive power of grammars over a one-letter alphabet has been researched.[ citation needed ]

This work provided a basis for the study of language equations of a more general form.

Synchronized alternating pushdown automata

Aizikowitz and Kaminski [2] introduced a new class of pushdown automata (PDA) called synchronized alternating pushdown automata (SAPDA). They proved it to be equivalent to conjunctive grammars in the same way as nondeterministic PDAs are equivalent to context-free grammars.

Notes

  1. Given a conjunctive grammar, is its generated language empty / finite / regular / context-free?
  2. Given two conjunctive grammars, is the first's generated language a subset of / equal to the second's?

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

In the theory of formal languages of computer science, mathematics, and linguistics, a Dyck word is a balanced string of brackets. The set of Dyck words forms a Dyck language. The simplest, D1, uses just two matching brackets, e.g..

Boolean grammars, introduced by Okhotin, are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boolean grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction and negation can be used, in particular, to specify intersection and complement of languages. An intermediate class of grammars known as conjunctive grammars allows conjunction and disjunction, but not negation.

A matrix grammar is a formal grammar in which instead of single productions, productions are grouped together into finite sequences. A production cannot be applied separately, it must be applied in sequence. In the application of such a sequence of productions, the rewriting is done in accordance to each production in sequence, the first one, second one etc. till the last production has been used for rewriting. The sequences are referred to as matrices.

<span class="mw-page-title-main">Terminal and nonterminal symbols</span> Categories of symbols in formal grammars

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

Language equations are mathematical statements that resemble numerical equations, but the variables assume values of formal languages rather than numbers. Instead of arithmetic operations in numerical equations, the variables are joined by language operations. Among the most common operations on two languages A and B are the set union AB, the set intersection AB, and the concatenation AB. Finally, as an operation taking a single operand, the set A* denotes the Kleene star of the language A. Therefore, language equations can be used to represent formal grammars, since the languages generated by the grammar must be the solution of a system of language equations.

Interaction nets are a graphical model of computation devised by Yves Lafont in 1990 as a generalisation of the proof structures of linear logic. An interaction net system is specified by a set of agent types and a set of interaction rules. Interaction nets are an inherently distributed model of computation in the sense that computations can take place simultaneously in many parts of an interaction net, and no synchronisation is needed. The latter is guaranteed by the strong confluence property of reduction in this model of computation. Thus interaction nets provide a natural language for massive parallelism. Interaction nets are at the heart of many implementations of the lambda calculus, such as efficient closed reduction and optimal, in Lévy's sense, Lambdascope.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol.

An embedded pushdown automaton or EPDA is a computational model for parsing languages generated by tree-adjoining grammars (TAGs). It is similar to the context-free grammar-parsing pushdown automaton, but instead of using a plain stack to store symbols, it has a stack of iterated stacks that store symbols, giving TAGs a generative capacity between context-free and context-sensitive grammars, or a subset of mildly context-sensitive grammars. Embedded pushdown automata should not be confused with nested stack automata which have more computational power.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language. It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by Rohit Parikh in 1961 and republished in 1966.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

References

  1. Alexander Okhotin (2001). "Conjunctive Grammars" (PDF). Journal of Automata, Languages and Combinatorics. 6 (4): 519–535.
  2. Aizikowitz, Tamar; Kaminski, Michael (2011). "LR(0) Conjunctive Grammars and Deterministic Synchronized Alternating Pushdown Automata". Computer Science – Theory and Applications. Lecture Notes in Computer Science. Vol. 6651. pp. 345–358. doi:10.1007/978-3-642-20712-9_27. ISBN   978-3-642-20711-2. ISSN   0302-9743.