Ukkonen's algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, Ukkonen's algorithm is a linear-time,
online algorithm In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an o ...
for constructing
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
s, proposed by
Esko Ukkonen Esko Juhani Ukkonen (b. 1950) is a Finnish theoretical computer scientist known for his contributions to string algorithms, and particularly for Ukkonen's algorithm for suffix tree construction. He is a professor emeritus of the University of He ...
in 1995. The algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string, adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property. The original algorithm presented by
Peter Weiner Peter may refer to: People * List of people named Peter, a list of people and fictional characters with the given name * Peter (given name) ** Saint Peter (died 60s), apostle of Jesus, leader of the early Christian Church * Peter (surname), a su ...
proceeded backward from the last character to the first one from the shortest to the longest suffix. A simpler algorithm was found by
Edward M. McCreight Edward Meyers McCreight is an American computer scientist. He received his Ph.D. in computer science from Carnegie Mellon University in 1969, advised by Albert R. Meyer. He co-invented the B-tree with Rudolf Bayer while at Boeing, and improved ...
, going from the longest to the shortest suffix.


Implicit suffix tree

While generating suffix tree using Ukkonen's algorithm, we will see implicit suffix tree in intermediate steps depending on characters in string S. In implicit suffix trees, there will be no edge with $ (or any other termination character) label and no internal node with only one edge going out of it.


High level description of Ukkonen's algorithm

Ukkonen's algorithm constructs an implicit suffix tree T for each prefix S ...iof S (S being the string of length n). It first builds T using 1 character, then T using 2 character, then T using 3 character, ..., T using the n character. You can find the following characteristics in a suffix tree that uses Ukkonen's algorithm: * Implicit suffix tree T is built on top of implicit suffix tree T . * At any given time, Ukkonen's algorithm builds the suffix tree for the characters seen so far and so it has on-line property, allowing the algorithm to have an execution time of ''O(n).'' * Ukkonen's algorithm is divided into n phases (one phase for each character in the string with length n). * Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S ...i+1 Suffix extension is all about adding the next character into the suffix tree built so far. In extension j of phase i+1, algorithm finds the end of S ...i(which is already in the tree due to previous phase i) and then it extends S ...ito be sure the suffix S ...i+1is in the tree. There are three extension rules: # If the path from the root labelled S ...iends at a leaf edge (i.e., S is last character on leaf edge), then character S +1is just added to the end of the label on that leaf edge. # if the path from the root labelled S ...iends at a non-leaf edge (i.e., there are more characters after S on path) and next character is not S +1 then a new leaf edge with label S +1and number j is created starting from character S +1 A new internal node will also be created if S ...iends inside (in between) a non-leaf edge. # If the path from the root labelled S ..iends at a non-leaf edge (i.e., there are more characters after S on path) and next character is S +1(already in tree), do nothing. One important point to note is that from a given node (root or internal), there will be one and only one edge starting from one character. There will not be more than one edge going out of any node starting with the same character.


Run time

The naive implementation for generating a suffix tree going forward requires or even time complexity in
big O notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Lan ...
, where is the length of the string. By exploiting a number of algorithmic techniques, Ukkonen reduced this to (linear) time, for constant-size alphabets, and in general, matching the runtime performance of the earlier two algorithms.


Ukkonen's algorithm example

To better illustrate how a suffix tree using Ukkonen's algorithm is constructed, we can use the following example: S=xabxac # Start with an empty root node. # Construct T for S by adding the first character of the string. Rule 2 applies, which creates a new leaf node. # Construct T for S ...2by adding suffixes of xa (xa and a). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node. # Construct T for S ...3by adding suffixes of xab (xab, ab and b). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node. # Construct T for S ...4by adding suffixes of xabx (xabx, abx, bx and x). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing. # Constructs T for S ...5by adding suffixes of xabxa (xabxa, abxa, bxa, xa and x). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing. # Constructs T for S ...6by adding suffixes of xabxac (xabxac, abxac, bxac, xac, ac and c). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node (in this case, three new lead edges and two new internal nodes are created).


References


External links


Detailed explanation in plain English

Fast String Searching With Suffix Trees
Mark Nelson's tutorial. Has an implementation example written with C++.


Lecture slides by Guy Blelloch

Ukkonen's homepage

Text-Indexing project
(Ukkonen's linear-time construction of suffix trees) * Implementation in
Part 1Part 2Part 3Part 4Part 5Part 6
Bioinformatics algorithms Algorithms on strings Substring indices {{algorithm-stub