Star height

Last updated

In theoretical computer science, more precisely in the theory of formal languages, the star height is a measure for the structural complexity of regular expressions and regular languages. The star height of a regular expression equals the maximum nesting depth of stars appearing in that expression. The star height of a regular language is the least star height of any regular expression for that language. The concept of star height was first defined and studied by Eggan (1963).

Contents

Formal definition

More formally, the star height of a regular expression E over a finite alphabet A is inductively defined as follows:

Here, is the special regular expression denoting the empty set and ε the special one denoting the empty word; E and F are arbitrary regular expressions.

The star height h(L) of a regular language L is defined as the minimum star height among all regular expressions representing L. The intuition is here that if the language L has large star height, then it is in some sense inherently complex, since it cannot be described by means of an "easy" regular expression, of low star height.

Examples

While computing the star height of a regular expression is easy, determining the star height of a language can be sometimes tricky. For illustration, the regular expression

over the alphabet A = {a,b} has star height 2. However, the described language is just the set of all words ending in an a: thus the language can also be described by the expression

which is only of star height 1. To prove that this language indeed has star height 1, one still needs to rule out that it could be described by a regular expression of lower star height. For our example, this can be done by an indirect proof: One proves that a language of star height 0 contains only finitely many words. Since the language under consideration is infinite, it cannot be of star height 0.

The star height of a group language is computable: for example, the star height of the language over {a,b} in which the number of occurrences of a and b are congruent modulo 2n is n. [1]

Eggan's theorem

Example automaton of cycle rank 1. Kleene's algorithm transforms it into the regular expression abba ((a|b)ba|e) (a|b)b | abb, which has star-height 2. By Eggan's theorem, an equivalent regular expression of star-height <=1 must exist. In fact, ab(b|a(a|b)) describes the same language. Deterministicfiniteautomaton.svg
Example automaton of cycle rank 1. Kleene's algorithm transforms it into the regular expression abba ((a|b)ba|ε) (a|b)b | abb, which has star-height 2. By Eggan's theorem, an equivalent regular expression of star-height ≤1 must exist. In fact, ab(b|a(a|b)) describes the same language.

In his seminal study of the star height of regular languages, Eggan (1963) established a relation between the theories of regular expressions, finite automata, and of directed graphs. In subsequent years, this relation became known as Eggan's theorem, cf. Sakarovitch (2009). We recall a few concepts from graph theory and automata theory.

In graph theory, the cycle rank r(G) of a directed graph (digraph) G = (V, E) is inductively defined as follows:

 where is the digraph resulting from deletion of vertex v and all edges beginning or ending at v.

In automata theory, a nondeterministic finite automaton with ε-transitions (ε-NFA) is defined as a 5-tuple, (Q, Σ, δ, q0, F), consisting of

A word w ∈ Σ* is accepted by the ε-NFA if there exists a directed path from the initial state q0 to some final state in F using edges from δ, such that the concatenation of all labels visited along the path yields the word w. The set of all words over Σ* accepted by the automaton is the language accepted by the automaton A.

When speaking of digraph properties of a nondeterministic finite automaton A with state set Q, we naturally address the digraph with vertex set Q induced by its transition relation. Now the theorem is stated as follows.

Eggan's Theorem: The star height of a regular language L equals the minimum cycle rank among all nondeterministic finite automata with ε-transitions accepting L.

Proofs of this theorem are given by Eggan (1963), and more recently by Sakarovitch (2009).

Generalized star height

The above definition assumes that regular expressions are built from the elements of the alphabet A using only the standard operators set union, concatenation, and Kleene star. Generalized regular expressions are defined just as regular expressions, but here also the set complement operator is allowed (the complement is always taken with respect to the set of all words over A). If we alter the definition such that taking complements does not increase the star height, that is,

we can define the generalized star height of a regular language L as the minimum star height among all generalized regular expressions representing L. It is an open problem whether some languages can only be expressed with a generalized star height greater than one: this is the generalized star-height problem.

Note that, whereas it is immediate that a language of (ordinary) star height 0 can contain only finitely many words, there exist infinite languages having generalized star height 0. For instance, the regular expression

which we saw in the example above, can be equivalently described by the generalized regular expression

,

since the complement of the empty set is precisely the set of all words over A. Thus the set of all words over the alphabet A ending in the letter a has star height one, while its generalized star height equals zero.

Languages of generalized star height zero are also called star-free languages. It can be shown that a language L is star-free if and only if its syntactic monoid is aperiodic (Schützenberger (1965)).

See also

Related Research Articles

<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

The star height problem in formal language theory is the question whether all regular languages can be expressed using regular expressions of limited star height, i.e. with a limited nesting depth of Kleene stars. Specifically, is a nesting depth of one always sufficient? If not, is there an algorithm to determine how many are required? The problem was raised by Eggan (1963).

<span class="mw-page-title-main">Automata theory</span> Study of abstract machines and automata

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science. The word automata comes from the Greek word αὐτόματος, which means "self-acting, self-willed, self-moving". An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a Finite Automaton (FA) or Finite-State Machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.

In mathematics, a Kleene algebra is an idempotent semiring endowed with a closure operator. It generalizes the operations known from regular expressions.

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

The generalized star-height problem in formal language theory is the open question whether all regular languages can be expressed using generalized regular expressions with a limited nesting depth of Kleene stars. Here, generalized regular expressions are defined like regular expressions, but they have a built-in complement operator. For a regular language, its generalized star height is defined as the minimum nesting depth of Kleene stars needed in order to describe the language by means of a generalized regular expression, hence the name of the problem.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In formal language theory, an alphabet is a non-empty set of symbols/glyphs, typically thought of as representing letters, characters, or digits but among other possibilities the "symbols" could also be a set of phonemes. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

In quantum computing, quantum finite automata (QFA) or quantum state machines are a quantum analog of probabilistic automata or a Markov decision process. They provide a mathematical abstraction of real-world quantum computers. Several types of automata may be defined, including measure-once and measure-many automata. Quantum finite automata can also be understood as the quantization of subshifts of finite type, or as a quantization of Markov chains. QFAs are, in turn, special cases of geometric finite automata or topological finite automata.

In mathematics and theoretical computer science, a semiautomaton is a deterministic finite automaton having inputs but no output. It consists of a set Q of states, a set Σ called the input alphabet, and a function T: Q × Σ → Q called the transition function.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In graph theory, the cycle rank of a directed graph is a digraph connectivity measure proposed first by Eggan and Büchi. Intuitively, this concept measures how close a digraph is to a directed acyclic graph (DAG), in the sense that a DAG has cycle rank zero, while a complete digraph of order n with a self-loop at each vertex has cycle rank n. The cycle rank of a directed graph is closely related to the tree-depth of an undirected graph and to the star height of a regular language. It has also found use in sparse matrix computations and logic (Rossman 2008).

In computer science, more precisely in automata theory, a rational set of a monoid is an element of the minimal class of subsets of this monoid that contains all finite subsets and is closed under union, product and Kleene star. Rational sets are useful in automata theory, formal languages and algebra.

In mathematics, a local language is a formal language for which membership of a word in the language can be determined by looking at the first and last symbol and each two-symbol substring of the word. Equivalently, it is a language recognised by a local automaton, a particular kind of deterministic finite automaton.

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, and the use of Arden's lemma.

<span class="mw-page-title-main">Weighted automaton</span> Finite-state machine where edges carry weights

In theoretical computer science and formal language theory, a weighted automaton or weighted finite-state machine is a generalization of a finite-state machine in which the edges have weights, for example real numbers or integers. Finite-state machines are only capable of answering decision problems; they take as input a string and produce a Boolean output, i.e. either "accept" or "reject". In contrast, weighted automata produce a quantitative output, for example a count of how many answers are possible on a given input string, or a probability of how likely the input string is according to a probability distribution. They are one of the simplest studied models of quantitative automata.

In automata theory, an unambiguous finite automaton (UFA) is a nondeterministic finite automaton (NFA) such that each word has at most one accepting path. Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A which accepts the complement of A can be computed in linear time when A is a DFA, it is not known whether it can be done in polynomial time for UFA. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

References

  1. Sakarovitch (2009) p.342