Greibach's theorem

Last updated August 19, 2023

In theoretical computer science, in particular in formal language theory, Greibach's theorem states that certain properties of formal language classes are undecidable. It is named after the computer scientist Sheila Greibach, who first proved it in 1963.^[1]^[2]

Definitions

Given a set Σ, often called "alphabet", the (infinite) set of all strings built from members of Σ is denoted by Σ^*. A formal language is a subset of Σ^*. If L₁ and L₂ are formal languages, their product L₁L₂ is defined as the set { w₁w₂ : w₁ ∈ L₁, w₂ ∈ L₂ } of all concatenations of a string w₁ from L₁ with a string w₂ from L₂. If L is a formal language and a is a symbol from Σ, their quotient L/a is defined as the set { w : wa ∈ L } of all strings that can be made members of L by appending an a. Various approaches are known from formal language theory to denote a formal language by a finite description, such as a formal grammar or a finite-state machine.

For example, using an alphabet Σ = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }, the set Σ^* consists of all (decimal representations of) natural numbers, with leading zeroes allowed, and the empty string, denoted as ε. The set L_div3 of all naturals divisible by 3 is an infinite formal language over Σ; it can be finitely described by the following regular grammar with start symbol S₀:

S₀ →	ε \|	0 S₀	\| 1 S₂	\| 2 S₁	\| 3 S₀	\| 4 S₂	\| 5 S₁	\| 6 S₀	\| 7 S₂	\| 8 S₁	\| 9 S₀
S₁ →		0 S₁	\| 1 S₀	\| 2 S₂	\| 3 S₁	\| 4 S₀	\| 5 S₂	\| 6 S₁	\| 7 S₀	\| 8 S₂	\| 9 S₁
S₂ →		0 S₂	\| 1 S₁	\| 2 S₀	\| 3 S₂	\| 4 S₁	\| 5 S₀	\| 6 S₂	\| 7 S₁	\| 8 S₀	\| 9 S₂

Examples for finite languages are {ε,1,2} and {0,2,4,6,8}; their product {ε,1,2}{0,2,4,6,8} yields the even numbers up to 28. The quotient of the set of prime numbers up to 100 by the symbol 7, 4, and 2 yields the language {ε,1,3,4,6,9}, {}, and {ε}, respectively.

Formal statement of the theorem

Greibach's theorem is independent of a particular approach to describe a formal language. It just considers a set C of formal languages over an alphabet Σ∪{#} such that

each language in C has a finite description,
each regular language over Σ∪{#} is in C,^{[note 1]}
given descriptions of languages L₁, L₂ ∈ C and of a regular language R ∈ C, a description of the products L₁R and RL₁, and of the union L₁∪L₂ can be effectively computed, and
it is undecidable for any member language L ∈ C with L ⊆ Σ^* whether L = Σ^*.

Let P be any nontrivial subset of C that contains all regular sets over Σ∪{#} and is closed under quotient by each single symbol in Σ∪{#}.^{[note 2]} Then the question whether L ∈ P for a given description of a language L ∈ C is undecidable.

Proof

Let M ⊆ Σ^*, such that M ∈ C, but M ∉ P.^{[note 3]} For any L ∈ C with L ⊆ Σ^*, define φ(L) = (M#Σ^*) ∪ (Σ^*#L). From a description of L, a description of φ(L) can be effectively computed.

Then L = Σ^* if and only if φ(L) ∈ P:

If L = Σ^*, then φ(L) = Σ^*#Σ^* is a regular language, and hence in P.
Else, some w ∈ Σ^* \ L exists, and the quotient φ(L)/(#w) equals M. Therefore, by repeated application of the quotient-closure property, φ(L) ∈ P would imply M = φ(L)/(#w) ∈ P, contradicting the definition of M.

Hence, if membership in P would be decidable for φ(L) from its description, so would be L’s equality to Σ^* from its description, which contradicts the definition of C. ^[3]

Applications

Using Greibach's theorem, it can be shown that the following problems are undecidable:

Given a context-free grammar, does it describe a regular language?

Proof: The class of context-free languages, and the set of regular languages, satisfies the above properties of C, and P, respectively.^{[note 4]}^[4]

Given a context-free language, is it inherently ambiguous?

Proof: The class of context-free languages, and the set of context-free languages that aren't inherently ambiguous, satisfies the above properties of C, and P, respectively.^[5]

Given a context-sensitive grammar, does it describe a context-free language?

Notes

↑ This is left implicit in Hopcroft, Ullman, 1979: P ⊆ C needs to contain all these regular languages.
↑ That is, if L ∈ P, then L/a ∈ P for each a ∈ Σ∪{#}.
↑ The existence of such an M is required by the above somewhat vague requirement of P being "nontrivial".
↑ Regular languages are context-free: Context-free grammar#Subclasses; context-free languages are closed with respect to union and (even general) concatenation: Context-free grammar#Closure properties; equality to Σ^* is undecidable for context-free languages: Context-free grammar#Universality; regular languages are closed under (even general) quotients: Regular language#Closure properties.

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In computer science, a linear grammar is a context-free grammar that has at most one nonterminal in the right-hand side of each of its productions.

In automata theory, a deterministic pushdown automaton is a variation of the pushdown automaton. The class of deterministic pushdown automata accepts the deterministic context-free languages, a proper subset of context-free languages.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

In computer science, in particular in the field of formal language theory, an abstract family of languages is an abstract mathematical notion generalizing characteristics common to the regular languages, the context-free languages and the recursively enumerable languages, and other families of formal languages studied in the scientific literature.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

In formal language theory, a cone is a set of formal languages that has some desirable closure properties enjoyed by some well-known sets of languages, in particular by the families of regular languages, context-free languages and the recursively enumerable languages. The concept of a cone is a more abstract notion that subsumes all of these families. A similar notion is the faithful cone, having somewhat relaxed conditions. For example, the context-sensitive languages do not form a cone, but still have the required properties to form a faithful cone.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative $of a set of strings and a string is the set of all strings obtainable from a string in by cutting off the prefix . Formally:$

In computer science, in particular in formal language theory, a quotient automaton can be obtained from a given nondeterministic finite automaton by joining some of its states. The quotient recognizes a superset of the given automaton; in some cases, handled by the Myhill–Nerode theorem, both languages are equal.

References

↑ Sheila Greibach (1963). "The undecidability of the ambiguity problem for minimal linear grammars". Information and Control. 6 (2): 117–125. doi:10.1016/s0019-9958(63)90149-9.
↑ Sheila Greibach (1968). "A note on undecidable properties of formal languages". Math Systems Theory. 2 (1): 1–6. doi:10.1007/bf01691341. S2CID 19948229.
↑ John E. Hopcroft; Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Addison-Wesley. ISBN 0-201-02988-X. p.205-206
↑ Hopcroft, Ullman, 1979, p.205, Theorem 8.15
↑ Hopcroft, Ullman, 1979, p.206, Theorem 8.16

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[3] This is left implicit in Hopcroft, Ullman, 1979: P ⊆ C needs to contain all these regular languages.

[4] That is, if L ∈ P, then L/a ∈ P for each a ∈ Σ∪{#}.

[5] The existence of such an M is required by the above somewhat vague requirement of P being "nontrivial".

[7] Regular languages are context-free: Context-free grammar#Subclasses; context-free languages are closed with respect to union and (even general) concatenation: Context-free grammar#Closure properties; equality to Σ^* is undecidable for context-free languages: Context-free grammar#Universality; regular languages are closed under (even general) quotients: Regular language#Closure properties.

[1] Sheila Greibach (1963). "The undecidability of the ambiguity problem for minimal linear grammars". Information and Control. 6 (2): 117–125. doi:10.1016/s0019-9958(63)90149-9.

[2] Sheila Greibach (1968). "A note on undecidable properties of formal languages". Math Systems Theory. 2 (1): 1–6. doi:10.1007/bf01691341. S2CID 19948229.

[6] John E. Hopcroft; Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Addison-Wesley. ISBN 0-201-02988-X. p.205-206

[8] Hopcroft, Ullman, 1979, p.205, Theorem 8.15

[9] Hopcroft, Ullman, 1979, p.206, Theorem 8.16

[1]

[2]

[note 1]

[note 2]

[note 3]

[3]

[note 4]

[4]

[5]