Kleene's algorithm

Last updated

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, [1] and the use of Arden's lemma.

Contents

Algorithm description

According to Gross and Yellen (2004), [2] the algorithm can be traced back to Kleene (1956). [3] A presentation of the algorithm in the case of deterministic finite automata (DFAs) is given in Hopcroft and Ullman (1979). [4] The presentation of the algorithm for NFAs below follows Gross and Yellen (2004). [2]

Given a nondeterministic finite automaton M = (Q, Σ, δ, q0, F), with Q = { q0,...,qn } its set of states, the algorithm computes

the sets Rk
ij
of all strings that take M from state qi to qj without going through any state numbered higher than k.

Here, "going through a state" means entering and leaving it, so both i and j may be higher than k, but no intermediate state may. Each set Rk
ij
is represented by a regular expression; the algorithm computes them step by step for k = -1, 0, ..., n. Since there is no state numbered higher than n, the regular expression Rn
0j
represents the set of all strings that take M from its start state q0 to qj. If F = { q1,...,qf } is the set of accept states, the regular expression Rn
01
| ... | Rn
0f
represents the language accepted by M.

The initial regular expressions, for k = -1, are computed as follows for ij:

R−1
ij
= a1 | ... | am    where qj ∈ δ(qi,a1), ..., qj ∈ δ(qi,am)

and as follows for i=j:

R−1
ii
= a1 | ... | am | ε     where qi ∈ δ(qi,a1), ..., qi ∈ δ(qi,am)

In other words, R−1
ij
mentions all letters that label a transition from i to j, and we also include ε in the case where i=j.

After that, in each step the expressions Rk
ij
are computed from the previous ones by

Rk
ij
= Rk-1
ik
(Rk-1
kk
)*Rk-1
kj
| Rk-1
ij

Another way to understand the operation of the algorithm is as an "elimination method", where the states from 0 to n are successively removed: when state k is removed, the regular expression Rk-1
ij
, which describes the words that label a path from state i>k to state j>k, is rewritten into Rk
ij
so as to take into account the possibility of going via the "eliminated" state k.

By induction on k, it can be shown that the length [5] of each expression Rk
ij
is at most 1/3(4k+1(6s+7) - 4) symbols, where s denotes the number of characters in Σ. Therefore, the length of the regular expression representing the language accepted by M is at most 1/3(4n+1(6s+7)f - f - 3) symbols, where f denotes the number of final states. This exponential blowup is inevitable, because there exist families of DFAs for which any equivalent regular expression must be of exponential size. [6]

In practice, the size of the regular expression obtained by running the algorithm can be very different depending on the order in which the states are considered by the procedure, i.e., the order in which they are numbered from 0 to n.

Example

Example DFA given to Kleene's algorithm Deterministicfiniteautomaton.svg
Example DFA given to Kleene's algorithm

The automaton shown in the picture can be described as M = (Q, Σ, δ, q0, F) with

Kleene's algorithm computes the initial regular expressions as

R−1
00
  
= a | ε
R−1
01
= b
R−1
02
= ∅
R−1
10
= ∅
R−1
11
= b | ε
R−1
12
= a
R−1
20
= ∅
R−1
21
= a | b
R−1
22
= ε

After that, the Rk
ij
are computed from the Rk-1
ij
step by step for k = 0, 1, 2. Kleene algebra equalities are used to simplify the regular expressions as much as possible.

Step 0
R0
00
  
= R−1
00
(R−1
00
)*R−1
00
| R−1
00
  
= (a | ε)(a | ε)*(a | ε)| a | ε   = a*
R0
01
= R−1
00
(R−1
00
)*R−1
01
| R−1
01
= (a | ε)(a | ε)*b| b= a*b
R0
02
= R−1
00
(R−1
00
)*R−1
02
| R−1
02
= (a | ε)(a | ε)*| ∅= ∅
R0
10
= R−1
10
(R−1
00
)*R−1
00
| R−1
10
= ∅(a | ε)*(a | ε)| ∅= ∅
R0
11
= R−1
10
(R−1
00
)*R−1
01
| R−1
11
= ∅(a | ε)*b| b | ε= b | ε
R0
12
= R−1
10
(R−1
00
)*R−1
02
| R−1
12
= ∅(a | ε)*| a= a
R0
20
= R−1
20
(R−1
00
)*R−1
00
| R−1
20
= ∅(a | ε)*(a | ε)| ∅= ∅
R0
21
= R−1
20
(R−1
00
)*R−1
01
| R−1
21
= ∅(a | ε)*b| a | b= a | b
R0
22
= R−1
20
(R−1
00
)*R−1
02
| R−1
22
= ∅(a | ε)*| ε= ε
Step 1
R1
00
  
= R0
01
(R0
11
)*R0
10
| R0
00
  
= a*b(b | ε)*| a*    = a*
R1
01
= R0
01
(R0
11
)*R0
11
| R0
01
= a*b(b | ε)*(b | ε)| a*b= a*b*b
R1
02
= R0
01
(R0
11
)*R0
12
| R0
02
= a*b(b | ε)*a| ∅= a*b*ba
R1
10
= R0
11
(R0
11
)*R0
10
| R0
10
= (b | ε)(b | ε)*| ∅= ∅
R1
11
= R0
11
(R0
11
)*R0
11
| R0
11
= (b | ε)(b | ε)*(b | ε)| b | ε= b*
R1
12
= R0
11
(R0
11
)*R0
12
| R0
12
= (b | ε)(b | ε)*a| a= b*a
R1
20
= R0
21
(R0
11
)*R0
10
| R0
20
= (a | b)(b | ε)*| ∅= ∅
R1
21
= R0
21
(R0
11
)*R0
11
| R0
21
= (a | b)(b | ε)*(b | ε)| a | b= (a | b) b*
R1
22
= R0
21
(R0
11
)*R0
12
| R0
22
= (a | b)(b | ε)*a| ε= (a | b) b*a | ε
Step 2
R2
00
  
= R1
02
(R1
22
)*R1
20
| R1
00
  
= a*b*ba((a|b)b*a | ε)*| a*= a*
R2
01
= R1
02
(R1
22
)*R1
21
| R1
01
= a*b*ba((a|b)b*a | ε)*(a|b)b*| a*b*b= a*b (a (a | b) | b)*
R2
02
= R1
02
(R1
22
)*R1
22
| R1
02
= a*b*ba((a|b)b*a | ε)*((a|b)b*a | ε)| a*b*ba= a*b*b (a (a | b) b*)*a
R2
10
= R1
12
(R1
22
)*R1
20
| R1
10
= b*a((a|b)b*a | ε)*| ∅= ∅
R2
11
= R1
12
(R1
22
)*R1
21
| R1
11
= b*a((a|b)b*a | ε)*(a|b)b*| b*= (a (a | b) | b)*
R2
12
= R1
12
(R1
22
)*R1
22
| R1
12
= b*a((a|b)b*a | ε)*((a|b)b*a | ε)| b*a= (a (a | b) | b)*a
R2
20
= R1
22
(R1
22
)*R1
20
| R1
20
= ((a|b)b*a | ε)((a|b)b*a | ε)*| ∅= ∅
R2
21
= R1
22
(R1
22
)*R1
21
| R1
21
= ((a|b)b*a | ε)((a|b)b*a | ε)*(a|b)b*| (a | b) b*= (a | b) (a (a | b) | b)*
R2
22
= R1
22
(R1
22
)*R1
22
| R1
22
= ((a|b)b*a | ε)((a|b)b*a | ε)*((a|b)b*a | ε)| (a | b) b*a | ε   = ((a | b) b*a)*

Since q0 is the start state and q1 is the only accept state, the regular expression R2
01
denotes the set of all strings accepted by the automaton.

See also

Related Research Articles

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

<span class="mw-page-title-main">Automata theory</span> Study of abstract machines and automata

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means "self-acting, self-willed, self-moving". An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a Finite Automaton (FA) or Finite-State Machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.

In mathematics, a Kleene algebra is an idempotent semiring endowed with a closure operator. It generalizes the operations known from regular expressions.

<span class="mw-page-title-main">Büchi automaton</span>

In computer science and automata theory, a deterministic Büchi automaton is a theoretical machine which either accepts or rejects infinite inputs. Such a machine has a set of states and a transition function, which determines which state the machine should move to from its current state when it reads the next input character. Some states are accepting states and one state is the start state. The machine accepts an input if and only if it will pass through an accepting state infinitely many times as it reads the input.

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In theoretical computer science, more precisely in the theory of formal languages, the star height is a measure for the structural complexity of regular expressions and regular languages. The star height of a regular expression equals the maximum nesting depth of stars appearing in that expression. The star height of a regular language is the least star height of any regular expression for that language. The concept of star height was first defined and studied by Eggan (1963).

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs.

In automata theory, a permutation automaton, or pure-group automaton, is a deterministic finite automaton such that each input symbol permutes the set of states.

The ω-regular languages are a class of ω-languages that generalize the definition of regular languages to infinite words.

In quantum computing, quantum finite automata (QFA) or quantum state machines are a quantum analog of probabilistic automata or a Markov decision process. They provide a mathematical abstraction of real-world quantum computers. Several types of automata may be defined, including measure-once and measure-many automata. Quantum finite automata can also be understood as the quantization of subshifts of finite type, or as a quantization of Markov chains. QFAs are, in turn, special cases of geometric finite automata or topological finite automata.

In mathematics and computer science, the probabilistic automaton (PA) is a generalization of the nondeterministic finite automaton; it includes the probability of a given transition into the transition function, turning it into a transition matrix. Thus, the probabilistic automaton also generalizes the concepts of a Markov chain and of a subshift of finite type. The languages recognized by probabilistic automata are called stochastic languages; these include the regular languages as a subset. The number of stochastic languages is uncountable.

In graph theory, the cycle rank of a directed graph is a digraph connectivity measure proposed first by Eggan and Büchi. Intuitively, this concept measures how close a digraph is to a directed acyclic graph (DAG), in the sense that a DAG has cycle rank zero, while a complete digraph of order n with a self-loop at each vertex has cycle rank n. The cycle rank of a directed graph is closely related to the tree-depth of an undirected graph and to the star height of a regular language. It has also found use in sparse matrix computations and logic (Rossman 2008).

In automata theory, a branch of theoretical computer science, an ω-automaton is a variation of finite automata that runs on infinite, rather than finite, strings as input. Since ω-automata do not stop, they have a variety of acceptance conditions rather than simply a set of accepting states.

In automata theory, a semi-deterministic Büchi automaton is a special type of Büchi automaton. In such an automaton, the set of states can be partitioned into two subsets: one subset forms a deterministic automaton and also contains all the accepting states.

In automata theory, complementation of a Büchi automaton is construction of another Büchi automaton that recognizes the complement of the ω-regular language recognized by the given Büchi automaton. Existence of algorithms for this construction proves that the set of ω-regular languages is closed under complementation.

In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

A tree stack automaton is a formalism considered in automata theory. It is a finite state automaton with the additional ability to manipulate a tree-shaped stack. It is an automaton with storage whose storage roughly resembles the configurations of a thread automaton. A restricted class of tree stack automata recognises exactly the languages generated by multiple context-free grammars.

References

  1. McNaughton, R.; Yamada, H. (March 1960). "Regular Expressions and State Graphs for Automata". IRE Transactions on Electronic Computers. EC-9 (1): 39–47. doi:10.1109/TEC.1960.5221603. ISSN   0367-9950.
  2. 1 2 Jonathan L. Gross and Jay Yellen, ed. (2004). Handbook of Graph Theory. Discrete Mathematics and it Applications. CRC Press. ISBN   1-58488-090-2. Here: sect.2.1, remark R13 on p.65
  3. Kleene, Stephen C. (1956). "Representation of Events in Nerve Nets and Finite Automata" (PDF). Automata Studies, Annals of Math. Studies. Princeton Univ. Press. 34. Here: sect.9, p.37-40
  4. John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation . Addison-Wesley. ISBN   0-201-02988-X. Here: Section 3.2.1 pages 91-96
  5. More precisely, the number of regular-expression symbols, "ai", "ε", "|", "*", "·"; not counting parentheses.
  6. Gruber, Hermann; Holzer, Markus (2008). Aceto, Luca; Damgård, Ivan; Goldberg, Leslie Ann; Halldórsson, Magnús M.; Ingólfsdóttir, Anna; Walukiewicz, Igor (eds.). "Finite Automata, Digraph Connectivity, and Regular Expression Size". Automata, Languages and Programming. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 5126: 39–50. doi:10.1007/978-3-540-70583-3_4. ISBN   9783540705833. S2CID   10975422.. Theorem 16.