Thompson's construction

Last updated

In computer science, Thompson's construction algorithm, also called the McNaughton-Yamada-Thompson algorithm [1] , is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). [2] This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

Computer science Study of the theoretical foundations of information and computation

Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. It enables the use of algorithms to manipulate, store, and communicate digital information. A computer scientist studies the theory of computation and the practice of designing software systems.

Algorithm An unambiguous specification of how to solve a class of problems

In mathematics and computer science, an algorithm is a set of instructions, typically to solve a class of problems or perform a computation. Algorithms are unambiguous specifications for performing calculation, data processing, automated reasoning, and other tasks.

Regular expression Sequence of characters that forms a search pattern

A regular expression, regex or regexp is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

Contents

Regular expressions and nondeterministic finite automata are two representations of formal languages. For instance, text processing utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can compile regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages.

Formal language set of strings of symbols that may be constrained by rules that are specific to it; words whose letters are taken from an alphabet and are well-formed according to a specific set of rules

In mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

Text processing creating or manipulating electronic text

In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. Text usually refers to all the alphanumeric characters specified on the keyboard of the person engaging the practice, but in general text means the abstraction layer immediately above the standard character encoding of the target text. The term processing refers to automated processing, as opposed to the same manipulation done manually.

A compiler is a computer program that translates computer code written in one programming language into another language. The name compiler is primarily used for programs that translate source code from a high-level programming language to a lower level language to create an executable program.

An NFA can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also be interpreted directly.

In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs.

DFA minimization

In automata theory, DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equivalent if they recognize the same regular language. Several different algorithms accomplishing this task are known and described in standard textbooks on automata theory.

To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimal deterministic finite automaton via Thompson's construction, powerset construction, and DFA minimization. If, and only if, the resulting automata agree up to renaming of states, the regular expressions' languages agreed.

Deterministic finite automaton finite-state machine that accepts and rejects strings of symbols and only produces a unique computation (or run) of the automaton for each input string

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite state machine (DFSM), or deterministic finite state automaton (DFSA)—is a finite-state machine that accepts or rejects strings of symbols and only produces a unique computation of the automaton for each input string. Deterministic refers to the uniqueness of the computation. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

Up to Mathematical statement of uniqueness, except for an equivalent . structure (equivalence relation)

In mathematics, the phrase up to appears in discussions about the elements of a set, and the conditions under which subsets of those elements may be considered equivalent. The statement "elements a and b of set S are equivalent up to X" means that a and b are equivalent if criterion X is ignored. That is, a and b can be transformed into one another if a transform corresponding to X is applied.

The algorithm

The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules. [3] More precisely, from a regular expression E, the obtained automaton A with the transition function δ respects the following properties:

Rules

The following rules are depicted according to Aho et al. (2007), [1] p. 122. In what follows, N(s) and N(t) are the NFA of the subexpressions s and t, respectively.

The empty-expression ε is converted to

Thompson-epsilon.svg

A symbola of the input alphabet is converted to

Thompson-a-symbol.svg

The union expressions|t is converted to

Thompson-or.svg

State q goes via ε either to the initial state of N(s) or N(t). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA.

The concatenation expressionst is converted to

Thompson-concat.svg

The initial state of N(s) is the initial state of the whole NFA. The final state of N(s) becomes the initial state of N(t). The final state of N(t) is the final state of the whole NFA.

The Kleene star expressions* is converted to

Thompson-kleene-star.svg

An ε-transition connects initial and final state of the NFA with the sub-NFA N(s) in between. Another ε-transition from the inner final to the inner initial state of N(s) allows for repetition of expression s according to the star operator.


With these rules, using the empty expression and symbol rules as base cases, it is possible to prove with mathematical induction that any regular expression may be converted into an equivalent NFA. [1]

Example

Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm.

Small Example

Example of (e|a*b) using Thompson's construction, step by step Small-thompson-example.svg
Example of (ε|a*b) using Thompson's construction, step by step

The picture below shows the result of Thompson's construction on (ε|a*b). The pink oval corresponds to a, the teal oval corresponds to a*, the green oval corresponds to b, the orange oval corresponds to a*b, and the blue oval corresponds to ε.

Application of the algorithm

NFA obtained from regular expression (0|(1(01*(00)*0)*1)*)* Thompson's construction algorithm applied to regular expression for binary multiples of 3.gif
NFA obtained from regular expression (0|(1(01*(00)*0)*1)*)*

As an example, the picture shows the result of Thompson's construction algorithm on the regular expression (0|(1(01*(00)*0)*1)*)* that denotes the set of binary numbers that are multiples of 3:

{ ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ... }.

The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are named a-q for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with the entry and exit state of each subexpression colored in magenta and cyan, respectively. An ε as transition label is omitted for clarity unlabelled transitions are in fact ε transitions. The entry and exit state corresponding to the root expression q is the start and accept state of the automaton, respectively.

The algorithm's steps are as follows:

q:start converting Kleene star expression(0|(1(01*(00)*0)*1)*)*
b:start converting union expression0|(1(01*(00)*0)*1)*
a:convert symbol0
p:start converting Kleene star expression(1(01*(00)*0)*1)*
d:start converting concatenation expression1(01*(00)*0)*1
c:convert symbol1
n:start converting Kleene star expression(01*(00)*0)*
f:start converting concatenation expression01*(00)*0
e:convert symbol0
h:start converting Kleene star expression1*
g:convert symbol1
h:finished converting Kleene star expression1*
l:start converting Kleene star expression(00)*
j:start converting concatenation expression00
i:convert symbol0
k:convert symbol0
j:finished converting concatenation expression00
l:finished converting Kleene star expression(00)*
m:convert symbol0
f:finished converting concatenation expression01*(00)*0
n:finished converting Kleene star expression(01*(00)*0)*
o:convert symbol1
d:finished converting concatenation expression1(01*(00)*0)*1
p:finished converting Kleene star expression(1(01*(00)*0)*1)*
b:finished converting union expression0|(1(01*(00)*0)*1)*
q:finished converting Kleene star expression(0|(1(01*(00)*0)*1)*)*

An equivalent minimal deterministic automaton is shown below.

DFA example multiplies of 3.svg

Relation to other algorithms

Thompson's is one of several algorithms for constructing NFAs from regular expressions; [5] an earlier algorithm was given by McNaughton and Yamada. [6] Converse to Thompson's construction, Kleene's algorithm transforms a finite automaton into a regular expression.

Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed.

Related Research Articles

Pushdown automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be expressed using a regular expression, in the strict sense of the latter notion used in theoretical computer science.

In computer science and automata theory, a Büchi automaton is a type of ω-automaton, which extends a finite automaton to infinite inputs. It accepts an infinite input sequence if there exists a run of the automaton that visits one of the final states infinitely often. Büchi automata recognize the omega-regular languages, the infinite word version of regular languages. It is named after the Swiss mathematician Julius Richard Büchi who invented this kind of automaton in 1962.

In automata theory, a finite state machine is called a deterministic finite automaton (DFA), if

In the theory of computation, a generalized nondeterministic finite automaton (GNFA), also known as an expression automaton or a generalized nondeterministic finite state machine, is a variation of a nondeterministic finite automaton (NFA) where each transition is labeled with any regular expression. The GNFA reads blocks of symbols from the input which constitute a string as defined by the regular expression on the transition. There are several differences between a standard finite state machine and a generalized nondeterministic finite state machine. A GNFA must have only one start state and one accept state, and these cannot be the same state, whereas a NFA or DFA both may have several accept states, and the start state can be an accept state. A GNFA must have only one transition between any two states, whereas a NFA or DFA both allow for numerous transitions between states. In a GNFA, a state has a single transition to every state in the machine, although often it is a convention to ignore the transitions that are labelled with the empty set when drawing generalized nondeterministic finite state machines.

In theoretical computer science, more precisely in the theory of formal languages, the star height is a measure for the structural complexity of regular expressions and regular languages. The star height of a regular expression equals the maximum nesting depth of stars appearing in that expression. The star height of a regular language is the least star height of any regular expression for that language. The concept of star height was first defined and studied by Eggan (1963).

In automata theory, an alternating finite automaton (AFA) is a nondeterministic finite automaton whose transitions are divided into existential and universal transitions. For example, let A be an alternating automaton.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton that maps between two sets of symbols. An FST is more general than a finite-state automaton (FSA). An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In computer science, in particular in automata theory, a two-way finite automaton is a finite automaton that is allowed to re-read its input.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In graph theory, the cycle rank of a directed graph is a digraph connectivity measure proposed first by Eggan and Büchi. Intuitively, this concept measures how close a digraph is to a directed acyclic graph (DAG), in the sense that a DAG has cycle rank zero, while a complete digraph of order n with a self-loop at each vertex has cycle rank n. The cycle rank of a directed graph is closely related to the tree-depth of an undirected graph and to the star height of a regular language. It has also found use in sparse matrix computations and logic (Rossman 2008).

In automata theory, a branch of theoretical computer science, an ω-automaton is a variation of finite automatons that runs on infinite, rather than finite, strings as input. Since ω-automata do not stop, they have a variety of acceptance conditions rather than simply a set of accepting states.

The regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression that takes a very long time to evaluate. The attack exploits the fact that most regular expression implementations have exponential time worst case complexity: the time taken can grow exponentially in relation to input size. An attacker can thus cause a program to spend an unbounded amount of time processing by providing such a regular expression, either slowing down or becoming unresponsive.

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, and the use of Arden's lemma.

In computer science theory – particularly formal language theory – the Glushkov Construction Algorithm, invented by Victor Mikhailovich Glushkov, transforms a given regular expression into an equivalent nondeterministic finite automaton (NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of formal languages.

In automata theory, an unambiguous finite automaton (UFA) is a special kind of a nondeterministic finite automaton (NFA). Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A' which accepts the complement of A can be computed in linear time when A is a DFA, it is not known whether it can be done in polynomial time for UFA. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

References

  1. 1 2 3 Alfred Vaino Aho; Monica S. Lam; Ravi Sethi; Jeffrey D. Ullman (2007). "3.7.4 Construction of an NFA from a Regular Expression" (print). Compilers : Principles, Techniques, & Tools (2nd ed.). Boston, MA, USA: Pearson Addison-Wesley. p. 159–163. ISBN   9780321486813.
  2. Louden, Kenneth C. (1997). "2.4.1 From a Regular Expression to an NFA" (print). Compiler construction : Principles and Practice (3rd ed.). 20 Park Plaza Boston, MA 02116-4324, US: PWS Publishing Company. pp. 64–69. ISBN   978-0-534-93972-4.
  3. Ken Thompson (Jun 1968). "Programming Techniques: Regular expression search algorithm". Communications of the ACM. 11 (6): 419–422. doi:10.1145/363347.363387.
  4. Xing, Guangming. "Minimized Thompson NFA" (PDF).
  5. Watson, Bruce W. (1995). A taxonomy of finite automata construction algorithms (PDF) (Technical report). Eindhoven University of Technology. Computing Science Report 93/43.
  6. R. McNaughton, H. Yamada (Mar 1960). "Regular Expressions and State Graphs for Automata". IEEE Transactions on Electronic Computers. 9 (1): 39–47. doi:10.1109/TEC.1960.5221603.