Regular grammar

Last updated

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

Contents

Every regular grammar describes a regular language.

Strictly regular grammars

A right-regular grammar (also called right-linear grammar) is a formal grammar (N, Σ, P, S) in which all production rules in P are of one of the following forms:

  1. Aa
  2. AaB
  3. A → ε

where A, B, SN are non-terminal symbols, a ∈ Σ is a terminal symbol, and ε denotes the empty string, i.e. the string of length 0. S is called the start symbol.

In a left-regular grammar, (also called left-linear grammar), all rules obey the forms

  1. Aa
  2. ABa
  3. A → ε

The language described by a given grammar is the set of all strings that contain only terminal symbols and can be derived from the start symbol by repeated application of production rules. Two grammars are called weakly equivalent if they describe the same language.

Rules of both kinds must not be mixed; for example, the grammar with rule set { SaT, TSb, S→ε } is not regular, and describes the language , which is not regular, either.

Some textbooks and articles disallow empty production rules, and assume that the empty string is not present in languages.

Extended regular grammars

An extended right-regular grammar is one in which all rules obey one of

  1. Aw, where A is a non-terminal in N and w is in a (possibly empty) string of terminals Σ*
  2. AwB, where A and B are in N and w is in Σ*.

Some authors call this type of grammar a right-regular grammar (or right-linear grammar) [1] and the type above a strictly right-regular grammar (or strictly right-linear grammar). [2]

An extended left-regular grammar is one in which all rules obey one of

  1. Aw, where A is a non-terminal in N and w is in Σ*
  2. ABw, where A and B are in N and w is in Σ*.

Examples

An example of a right-regular grammar G with N = {S, A}, Σ = {a, b, c}, P consists of the following rules

S → aS
S → bA
A → ε
A → cA

and S is the start symbol. This grammar describes the same language as the regular expression a*bc*, viz. the set of all strings consisting of arbitrarily many "a"s, followed by a single "b", followed by arbitrarily many "c"s.

A somewhat longer but more explicit extended right-regular grammar G for the same regular expression is given by N = {S, A, B, C}, Σ = {a, b, c}, where P consists of the following rules:

S → A
A → aA
A → B
B → bC
C → ε
C → cC

...where each uppercase letter corresponds to phrases starting at the next position in the regular expression.

As an example from the area of programming languages, the set of all strings denoting a floating point number can be described by an extended right-regular grammar G with N = {S,A,B,C,D,E,F}, Σ = {0,1,2,3,4,5,6,7,8,9,+,−,.,e}, where S is the start symbol, and P consists of the following rules:

S → +A    A → 0A    B → 0C    C → 0C    D → +E    E → 0F    F → 0F
S → −AA → 1AB → 1CC → 1CD → −EE → 1FF → 1F
S → AA → 2AB → 2CC → 2CD → EE → 2FF → 2F
A → 3AB → 3CC → 3CE → 3FF → 3F
A → 4AB → 4CC → 4CE → 4FF → 4F
A → 5AB → 5CC → 5CE → 5FF → 5F
A → 6AB → 6CC → 6CE → 6FF → 6F
A → 7AB → 7CC → 7CE → 7FF → 7F
A → 8AB → 8CC → 8CE → 8FF → 8F
A → 9AB → 9CC → 9CE → 9FF → 9F
A → .BC → eDF → ε
A → BC → ε

Expressive power

There is a direct one-to-one correspondence between the rules of a (strictly) right-regular grammar and those of a nondeterministic finite automaton, such that the grammar generates exactly the language the automaton accepts. [3] Hence, the right-regular grammars generate exactly all regular languages. The left-regular grammars describe the reverses of all such languages, that is, exactly the regular languages as well.

Every strict right-regular grammar is extended right-regular, while every extended right-regular grammar can be made strict by inserting new non-terminals, such that the result generates the same language; hence, extended right-regular grammars generate the regular languages as well. Analogously, so do the extended left-regular grammars.

If empty productions are disallowed, only all regular languages that do not include the empty string can be generated. [4]

While regular grammars can only describe regular languages, the converse is not true: regular languages can also be described by non-regular grammars.

Mixing left-regular and right-regular rules

If mixing of left-regular and right-regular rules is allowed, we still have a linear grammar, but not necessarily a regular one. What is more, such a grammar need not generate a regular language: all linear grammars can be easily brought into this form, and hence, such grammars can generate exactly all linear languages, including non-regular ones.

For instance, the grammar G with N = {S, A}, Σ = {a, b}, P with start symbol S and rules

S → aA
A → Sb
S → ε

generates , the paradigmatic non-regular linear language.

See also

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL), also called a Chomsky type-2 language, is a language generated by a context-free grammar (CFG).

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. Birman originally named his formalism the TMG Schema (TS), after TMG, an early parser generator, but it was later given the name TDPL by Aho and Ullman in their classic anthology The Theory of Parsing, Translation and Compiling.

In computer science, a linear grammar is a context-free grammar that has at most one nonterminal in the right-hand side of each of its productions.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

In theoretical computer science, in particular in formal language theory, Greibach's theorem states that certain properties of formal language classes are undecidable. It is named after the computer scientist Sheila Greibach, who first proved it in 1963.

References

  1. John E. Hopcroft and Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Reading/MA: Addison-Wesley. ISBN   0-201-02988-X. Here: p.217 (left, right-regular grammars as subclasses of context-free grammars), p.79 (context-free grammars)
  2. Hopcroft and Ullman 1979 (p.229, exercise 9.2) call it a normal form for right-linear grammars.
  3. Hopcroft and Ullman 1979, p.218-219, Theorem 9.1 and 9.2
  4. Hopcroft and Ullman 1979, p.229, Exercise 9.2