Star height problem

Last updated March 18, 2024

The star height problem in formal language theory is the question whether all regular languages can be expressed using regular expressions of limited star height, i.e. with a limited nesting depth of Kleene stars. Specifically, is a nesting depth of one always sufficient? If not, is there an algorithm to determine how many are required? The problem was first introduced by Eggan in 1963.^[1]

Families of regular languages with unbounded star height

The first question was answered in the negative when in 1963, Eggan gave examples of regular languages of star height n for every n. Here, the star height h(L) of a regular language L is defined as the minimum star height among all regular expressions representing L. The first few languages found by Eggan are described in the following, by means of giving a regular expression for each language:

{\begin{alignedat}{2}e_{1}&=a_{1}^{*}\\e_{2}&=\left(a_{1}^{*}a_{2}^{*}a_{3}\right)^{*}\\e_{3}&=\left(\left(a_{1}^{*}a_{2}^{*}a_{3}\right)^{*}\left(a_{4}^{*}a_{5}^{*}a_{6}\right)^{*}a_{7}\right)^{*}\\e_{4}&=\left(\left(\left(a_{1}^{*}a_{2}^{*}a_{3}\right)^{*}\left(a_{4}^{*}a_{5}^{*}a_{6}\right)^{*}a_{7}\right)^{*}\left(\left(a_{8}^{*}a_{9}^{*}a_{10}\right)^{*}\left(a_{11}^{*}a_{12}^{*}a_{13}\right)^{*}a_{14}\right)^{*}a_{15}\right)^{*}\end{alignedat}}

The construction principle for these expressions is that expression $e_{n+1}$ is obtained by concatenating two copies of $e_{n}$ , appropriately renaming the letters of the second copy using fresh alphabet symbols, concatenating the result with another fresh alphabet symbol, and then by surrounding the resulting expression with a Kleene star. The remaining, more difficult part, is to prove that for $e_{n}$ there is no equivalent regular expression of star height less than n; a proof is given in Eggan (1963).

However, Eggan's examples use a large alphabet, of size 2ⁿ-1 for the language with star height n. He thus asked whether we can also find examples over binary alphabets. This was proved to be true shortly afterwards by Dejean and Schützenberger in 1966.^[2] Their examples can be described by an inductively defined family of regular expressions over the binary alphabet $\{a,b\}$ as follows–cf. Salomaa (1981):

{\begin{alignedat}{2}e_{1}&=(ab)^{*}\\e_{2}&=\left(aa(ab)^{*}bb(ab)^{*}\right)^{*}\\e_{3}&=\left(aaaa\left(aa(ab)^{*}bb(ab)^{*}\right)^{*}bbbb\left(aa(ab)^{*}bb(ab)^{*}\right)^{*}\right)^{*}\\\,&\cdots \\e_{n+1}&=(\,\underbrace {a\cdots a} _{2^{n}}\,\cdot \,e_{n}\,\cdot \,\underbrace {b\cdots b} _{2^{n}}\,\cdot \,e_{n}\,)^{*}\end{alignedat}}

Again, a rigorous proof is needed for the fact that $e_{n}$ does not admit an equivalent regular expression of lower star height. Proofs are given by Dejean & Schützenberger (1966) and by Salomaa (1981).

Computing the star height of regular languages

In contrast, the second question turned out to be much more difficult, and the question became a famous open problem in formal language theory for over two decades.^[3] For years, there was only little progress. The pure-group languages were the first interesting family of regular languages for which the star height problem was proved to be decidable.^[4] But the general problem remained open for more than 25 years until it was settled by Hashiguchi, who in 1988 published an algorithm to determine the star height of any regular language.^[5] The algorithm wasn't at all practical, being of non-elementary complexity. To illustrate the immense resource consumptions of that algorithm, Lombardy & Sakarovitch (2002) give some actual numbers:

[The procedure described by Hashiguchi] leads to computations that are by far impossible, even for very small examples. For instance, if L is accepted by a 4 state automaton of loop complexity 3 (and with a small 10 element transition monoid), then a very low minorant of the number of languages to be tested with L for equality is: $\left(10^{10^{10}}\right)^{\left(10^{10^{10}}\right)^{\left(10^{10^{10}}\right)}}.$
— S. Lombardy and J. Sakarovitch, Star Height of Reversible Languages and Universal Automata, LATIN 2002

Notice that alone the number $10^{10^{10}}$ has 10 billion zeros when written down in decimal notation, and is already by far larger than the number of atoms in the observable universe.

A much more efficient algorithm than Hashiguchi's procedure was devised by Kirsten in 2005.^[6] This algorithm runs, for a given nondeterministic finite automaton as input, within double-exponential space. Yet the resource requirements of this algorithm still greatly exceed the margins of what is considered practically feasible.

This algorithm has been optimized and generalized to trees by Colcombet and Löding in 2008,^[7] as part of the theory of regular cost functions. It has been implemented in 2017 in the tool suite Stamina.^[8]

Notes

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In abstract algebra, a semiring is an algebraic structure. It is a generalization of a ring, dropping the requirement that each element must have an additive inverse. At the same time, it is a generalization of bounded distributive lattices.

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A^∗. The free semigroup on A is the subsemigroup of A^∗ containing all elements except the empty string. It is usually denoted A⁺.

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

The generalized star-height problem in formal language theory is the open question whether all regular languages can be expressed using generalized regular expressions with a limited nesting depth of Kleene stars. Here, generalized regular expressions are defined like regular expressions, but they have a built-in complement operator. For a regular language, its generalized star height is defined as the minimum nesting depth of Kleene stars needed in order to describe the language by means of a generalized regular expression, hence the name of the problem.

In theoretical computer science, more precisely in the theory of formal languages, the star height is a measure for the structural complexity of regular expressions and regular languages. The star height of a regular expression equals the maximum nesting depth of stars appearing in that expression. The star height of a regular language is the least star height of any regular expression for that language. The concept of star height was first defined and studied by Eggan (1963).

In mathematics and computer science, the syntactic monoid $of a formal language is the smallest monoid that recognizes the language .$

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

A regular language is said to be star-free if it can be described by a regular expression constructed from the letters of the alphabet, the empty word, the empty set symbol, all boolean operators – including complementation – and concatenation but no Kleene star. The condition is equivalent to having generalized star height zero.

An aperiodic finite-state automaton is a finite-state automaton whose transition monoid is aperiodic.

In mathematics, specifically ring theory, the notion of quasiregularity provides a computationally convenient way to work with the Jacobson radical of a ring. In this article, we primarily concern ourselves with the notion of quasiregularity for unital rings. However, one section is devoted to the theory of quasiregularity in non-unital rings, which constitutes an important aspect of noncommutative ring theory.

In graph theory, the cycle rank of a directed graph is a digraph connectivity measure proposed first by Eggan and Büchi. Intuitively, this concept measures how close a digraph is to a directed acyclic graph (DAG), in the sense that a DAG has cycle rank zero, while a complete digraph of order n with a self-loop at each vertex has cycle rank n. The cycle rank of a directed graph is closely related to the tree-depth of an undirected graph and to the star height of a regular language. It has also found use in sparse matrix computations and logic (Rossman 2008).

In computer science, more precisely in automata theory, a rational set of a monoid is an element of the minimal class of subsets of this monoid that contains all finite subsets and is closed under union, product and Kleene star. Rational sets are useful in automata theory, formal languages and algebra.

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

In computer science theory – particularly formal language theory – Glushkov's construction algorithm, invented by Victor Mikhailovich Glushkov, transforms a given regular expression into an equivalent nondeterministic finite automaton (NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of formal languages.

In automata theory, an unambiguous finite automaton (UFA) is a nondeterministic finite automaton (NFA) such that each word has at most one accepting path. Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A′ which accepts the complement of A can be computed in linear time when A is a DFA, whereas it is known that this cannot be done in polynomial time for UFAs. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

Kosaburo Hashiguchi is a Japanese mathematician and computer scientist at the Toyohashi University of Technology and Okayama University, known for his research in formal language theory.

References

Brzozowski, Janusz A. (1980). "Open problems about regular languages". In Book, Ronald V. (ed.). Formal language theory—Perspectives and open problems. New York: Academic Press. pp. 23–47. ISBN 978-0-12-115350-2. (technical report version)
Colcombet, Thomas; Löding, Christof (2008). "The Nesting-Depth of Disjunctive μ-Calculus for Tree Languages and the Limitedness Problem". Computer Science Logic. Lecture Notes in Computer Science. Vol. 5213. pp. 416–430. doi:10.1007/978-3-540-87531-4_30. ISBN 978-3-540-87530-7. ISSN 0302-9743.
Dejean, Françoise; Schützenberger, Marcel-Paul (1966). "On a Question of Eggan". Information and Control . 9 (1): 23–25. doi:10.1016/S0019-9958(66)90083-0.
Eggan, Lawrence C. (1963). "Transition graphs and the star-height of regular events". Michigan Mathematical Journal . 10 (4): 385–397. doi: 10.1307/mmj/1028998975 . Zbl 0173.01504.
McNaughton, Robert (1967). "The Loop Complexity of Pure-Group Events". Information and Control. 11 (1–2): 167–176. doi:10.1016/S0019-9958(67)90481-0.
Salomaa, Arto (1981). Jewels of Formal Language Theory. Melbourne: Pitman Publishing. ISBN 978-0-273-08522-5. Zbl 0487.68063.

Star height problem

Contents

Families of regular languages with unbounded star height

Computing the star height of regular languages

See also

Notes

Related Research Articles

References

Further reading