HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
, Thompson's construction
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
into an equivalent
nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state ...
(NFA). This NFA can be used to match strings against the regular expression. This algorithm is credited to
Ken Thompson Kenneth Lane Thompson (born February 4, 1943) is an American pioneer of computer science. Thompson worked at Bell Labs for most of his career where he designed and implemented the original Unix operating system. He also invented the B programmi ...
. Regular expressions and nondeterministic finite automata are two representations of
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sy ...
s. For instance,
text processing In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. ''Text'' usually refers to all the alphanumeric characters specified on the keyboard of the person engaging t ...
utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can
compile In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. An NFA can be made deterministic by the
powerset construction In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the sa ...
and then be minimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also be interpreted directly. To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimal
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state autom ...
via Thompson's construction,
powerset construction In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the sa ...
, and
DFA minimization In automata theory (a branch of theoretical computer science), DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equival ...
. If, and only if, the resulting automata agree up to renaming of states, the regular expressions' languages agree.


The algorithm

The algorithm works
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules. More precisely, from a regular expression , the obtained automaton with the transition function respects the following properties: * has exactly one initial state , which is not accessible from any other state. That is, for any state and any letter , \Delta(q,a) does not contain . * has exactly one final state , which is not co-accessible from any other state. That is, for any letter , \Delta(q_f,a)=\emptyset. * Let be the number of concatenation of the regular expression and let be the number of symbols apart from parentheses — that is, , , and . Then, the number of states of is (linear in the size of ). * The number of transitions leaving any state is at most two. * Since an NFA of states and at most transitions from each state can match a string of length in time , a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet.


Rules

The following rules are depicted according to Aho et al. (2007), p. 122. In what follows, ''N''() and ''N''() are the NFA of the subexpressions and , respectively. The empty-expression ε is converted to A symbol ''a'' of the input alphabet is converted to The union expression , is converted to State ''q'' goes via ε either to the initial state of ''N''() or ''N''(). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA. The concatenation expression ''st'' is converted to The initial state of ''N''() is the initial state of the whole NFA. The final state of ''N''() becomes the initial state of ''N''(). The final state of ''N''() is the final state of the whole NFA. The
Kleene star In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...
expression * is converted to An ε-transition connects initial and final state of the NFA with the sub-NFA ''N''() in between. Another ε-transition from the inner final to the inner initial state of ''N''() allows for repetition of expression according to the star operator. * The parenthesized expression () is converted to ''N''() itself. With these rules, using the empty expression and symbol rules as base cases, it is possible to prove with
mathematical induction Mathematical induction is a method for proving that a statement ''P''(''n'') is true for every natural number ''n'', that is, that the infinitely many cases ''P''(0), ''P''(1), ''P''(2), ''P''(3), ...  all hold. Informal metaphors help ...
that any regular expression may be converted into an equivalent NFA.


Example

Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm.


Small Example

The picture below shows the result of Thompson's construction on (ε, a*b). The purple oval corresponds to ''a'', the teal oval corresponds to ''a*'', the green oval corresponds to ''b'', the orange oval corresponds to ''a*b'', and the blue oval corresponds to ''ε''.


Application of the algorithm

As an example, the picture shows the result of Thompson's construction algorithm on the regular expression (0, (1(01*(00)*0)*1)*)* that denotes the set of binary numbers that are multiples of 3: : . The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are named - for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with the and state of each subexpression colored in and , respectively. An ε as transition label is omitted for clarity — unlabelled transitions are in fact ε transitions. The entry and exit state corresponding to the root expression is the start and accept state of the automaton, respectively. The algorithm's steps are as follows: An equivalent minimal deterministic automaton is shown below.


Relation to other algorithms

Thompson's is one of several algorithms for constructing NFAs from regular expressions; an earlier algorithm was given by McNaughton and Yamada. Converse to Thompson's construction,
Kleene's algorithm In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equival ...
transforms a finite automaton into a regular expression. Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed.


References

{{Strings , state=collapsed Finite automata