# Thompson's construction

Last updated

In computer science, Thompson's construction algorithm, also called the McNaughton-Yamada-Thompson algorithm [1] , is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). [2] This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

## Contents

Regular expressions and nondeterministic finite automata are two representations of formal languages. For instance, text processing utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can compile regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages.

An NFA can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also be interpreted directly.

To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimal deterministic finite automaton via Thompson's construction, powerset construction, and DFA minimization. If, and only if, the resulting automata agree up to renaming of states, the regular expressions' languages agree.

## The algorithm

The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules. [3] More precisely, from a regular expression E, the obtained automaton A with the transition function δ respects the following properties:

• A has exactly one initial state q0, which is not accessible from any other state. That is, for any state q and any letter a, ${\displaystyle \delta (q,a)}$ does not contain q0.
• A has exactly one final state qf, which is not co-accessible from any other state. That is, for any letter a, ${\displaystyle \delta (q_{f},a)=\emptyset }$.
• Let c be the number of concatenation of the regular expression E and let s be the number of symbols apart from parentheses — that is, |, *, a and ε. Then, the number of states of A is 2sc (linear in the size of E).
• The number of transitions leaving any state is at most two.
• Since an NFA of m states and at most e transitions from each state can match a string of length n in time O(emn), a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet. [4] [ better source needed ]

### Rules

The following rules are depicted according to Aho et al. (2007), [1] p. 122. In what follows, N(s) and N(t) are the NFA of the subexpressions s and t, respectively.

The empty-expression ε is converted to

A symbola of the input alphabet is converted to

The union expressions|t is converted to

State q goes via ε either to the initial state of N(s) or N(t). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA.

The concatenation expressionst is converted to

The initial state of N(s) is the initial state of the whole NFA. The final state of N(s) becomes the initial state of N(t). The final state of N(t) is the final state of the whole NFA.

The Kleene star expressions* is converted to

An ε-transition connects initial and final state of the NFA with the sub-NFA N(s) in between. Another ε-transition from the inner final to the inner initial state of N(s) allows for repetition of expression s according to the star operator.

• The parenthesized expression (s) is converted to N(s) itself.

With these rules, using the empty expression and symbol rules as base cases, it is possible to prove with mathematical induction that any regular expression may be converted into an equivalent NFA. [1]

## Example

Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm.

### Small Example

The picture below shows the result of Thompson's construction on (ε|a*b). The pink oval corresponds to a, the teal oval corresponds to a*, the green oval corresponds to b, the orange oval corresponds to a*b, and the blue oval corresponds to ε.

### Application of the algorithm

As an example, the picture shows the result of Thompson's construction algorithm on the regular expression (0|(1(01*(00)*0)*1)*)* that denotes the set of binary numbers that are multiples of 3:

{ ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ... }.

The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are named a-q for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with the entry and exit state of each subexpression colored in magenta and cyan, respectively. An ε as transition label is omitted for clarity unlabelled transitions are in fact ε transitions. The entry and exit state corresponding to the root expression q is the start and accept state of the automaton, respectively.

The algorithm's steps are as follows:

 q: start converting Kleene star expression (0|(1(01*(00)*0)*1)*)* b: start converting union expression 0|(1(01*(00)*0)*1)* a: convert symbol 0 p: start converting Kleene star expression (1(01*(00)*0)*1)* d: start converting concatenation expression 1(01*(00)*0)*1 c: convert symbol 1 n: start converting Kleene star expression (01*(00)*0)* f: start converting concatenation expression 01*(00)*0 e: convert symbol 0 h: start converting Kleene star expression 1* g: convert symbol 1 h: finished converting Kleene star expression 1* l: start converting Kleene star expression (00)* j: start converting concatenation expression 00 i: convert symbol 0 k: convert symbol 0 j: finished converting concatenation expression 00 l: finished converting Kleene star expression (00)* m: convert symbol 0 f: finished converting concatenation expression 01*(00)*0 n: finished converting Kleene star expression (01*(00)*0)* o: convert symbol 1 d: finished converting concatenation expression 1(01*(00)*0)*1 p: finished converting Kleene star expression (1(01*(00)*0)*1)* b: finished converting union expression 0|(1(01*(00)*0)*1)* q: finished converting Kleene star expression (0|(1(01*(00)*0)*1)*)*

An equivalent minimal deterministic automaton is shown below.

## Relation to other algorithms

Thompson's is one of several algorithms for constructing NFAs from regular expressions; [5] an earlier algorithm was given by McNaughton and Yamada. [6] Converse to Thompson's construction, Kleene's algorithm transforms a finite automaton into a regular expression.

Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed.

## Related Research Articles

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be expressed using a regular expression, in the strict sense of the latter notion used in theoretical computer science.

In computer science and automata theory, a Büchi automaton is a type of ω-automaton, which extends a finite automaton to infinite inputs. It accepts an infinite input sequence if there exists a run of the automaton that visits one of the final states infinitely often. Büchi automata recognize the omega-regular languages, the infinite word version of regular languages. It is named after the Swiss mathematician Julius Richard Büchi who invented this kind of automaton in 1962.

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In the theory of computation, a generalized nondeterministic finite automaton (GNFA), also known as an expression automaton or a generalized nondeterministic finite state machine, is a variation of a nondeterministic finite automaton (NFA) where each transition is labeled with any regular expression. The GNFA reads blocks of symbols from the input which constitute a string as defined by the regular expression on the transition. There are several differences between a standard finite state machine and a generalized nondeterministic finite state machine. A GNFA must have only one start state and one accept state, and these cannot be the same state, whereas a NFA or DFA both may have several accept states, and the start state can be an accept state. A GNFA must have only one transition between any two states, whereas a NFA or DFA both allow for numerous transitions between states. In a GNFA, a state has a single transition to every state in the machine, although often it is a convention to ignore the transitions that are labelled with the empty set when drawing generalized nondeterministic finite state machines.

In theoretical computer science, more precisely in the theory of formal languages, the star height is a measure for the structural complexity of regular expressions and regular languages. The star height of a regular expression equals the maximum nesting depth of stars appearing in that expression. The star height of a regular language is the least star height of any regular expression for that language. The concept of star height was first defined and studied by Eggan (1963).

In automata theory, an alternating finite automaton (AFA) is a nondeterministic finite automaton whose transitions are divided into existential and universal transitions. For example, let A be an alternating automaton.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton that maps between two sets of symbols. An FST is more general than a finite-state automaton (FSA). An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs.

In computer science, in particular in automata theory, a two-way finite automaton is a finite automaton that is allowed to re-read its input.

In quantum computing, quantum finite automata (QFA) or quantum state machines are a quantum analog of probabilistic automata or a Markov decision process. They provide a mathematical abstraction of real-world quantum computers. Several types of automata may be defined, including measure-once and measure-many automata. Quantum finite automata can also be understood as the quantization of subshifts of finite type, or as a quantization of Markov chains. QFAs are, in turn, special cases of geometric finite automata or topological finite automata.

In automata theory, DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equivalent if they recognize the same regular language. Several different algorithms accomplishing this task are known and described in standard textbooks on automata theory.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In graph theory, the cycle rank of a directed graph is a digraph connectivity measure proposed first by Eggan and Büchi. Intuitively, this concept measures how close a digraph is to a directed acyclic graph (DAG), in the sense that a DAG has cycle rank zero, while a complete digraph of order n with a self-loop at each vertex has cycle rank n. The cycle rank of a directed graph is closely related to the tree-depth of an undirected graph and to the star height of a regular language. It has also found use in sparse matrix computations and logic (Rossman 2008).

In automata theory, a branch of theoretical computer science, an ω-automaton is a variation of finite automatons that runs on infinite, rather than finite, strings as input. Since ω-automata do not stop, they have a variety of acceptance conditions rather than simply a set of accepting states.

The regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression that takes a very long time to evaluate. The attack exploits the fact that most regular expression implementations have exponential time worst case complexity: the time taken can grow exponentially in relation to input size. An attacker can thus cause a program to spend an unbounded amount of time processing by providing such a regular expression, either slowing down or becoming unresponsive.

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, and the use of Arden's lemma.

In computer science theory – particularly formal language theory – the Glushkov Construction Algorithm, invented by Victor Mikhailovich Glushkov, transforms a given regular expression into an equivalent nondeterministic finite automaton (NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of formal languages.

In automata theory, an unambiguous finite automaton (UFA) is a nondeterministic finite automaton (NFA) such that each word has at most one accepting path. Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A which accepts the complement of A can be computed in linear time when A is a DFA, it is not known whether it can be done in polynomial time for UFA. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

## References

1. Alfred Vaino Aho; Monica S. Lam; Ravi Sethi; Jeffrey D. Ullman (2007). "3.7.4 Construction of an NFA from a Regular Expression" (print). (2nd ed.). Boston, MA, USA: Pearson Addison-Wesley. p.  159–163. ISBN   9780321486813.
2. Louden, Kenneth C. (1997). "2.4.1 From a Regular Expression to an NFA" (print). Compiler construction : Principles and Practice (3rd ed.). 20 Park Plaza Boston, MA 02116-4324, US: PWS Publishing Company. pp. 64–69. ISBN   978-0-534-93972-4.CS1 maint: location (link)
3. Ken Thompson (Jun 1968). "Programming Techniques: Regular expression search algorithm". Communications of the ACM. 11 (6): 419–422. doi:10.1145/363347.363387.
4. Xing, Guangming. "Minimized Thompson NFA" (PDF).
5. Watson, Bruce W. (1995). A taxonomy of finite automata construction algorithms (PDF) (Technical report). Eindhoven University of Technology. Computing Science Report 93/43.
6. R. McNaughton, H. Yamada (Mar 1960). "Regular Expressions and State Graphs for Automata". IEEE Transactions on Electronic Computers. 9 (1): 39–47. doi:10.1109/TEC.1960.5221603.