Ukkonen's algorithm

Last updated March 27, 2024

In computer science, Ukkonen's algorithm is a linear-time, online algorithm for constructing suffix trees, proposed by Esko Ukkonen in 1995.^[1] The algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string, adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property. The original algorithm presented by Peter Weiner proceeded backward from the last character to the first one from the shortest to the longest suffix.^[2] A simpler algorithm was found by Edward M. McCreight, going from the longest to the shortest suffix.^[3]

Implicit suffix tree

While generating suffix tree using Ukkonen's algorithm, we will see implicit suffix tree in intermediate steps depending on characters in string S. In implicit suffix trees, there will be no edge with $ (or any other termination character) label and no internal node with only one edge going out of it.

High level description of Ukkonen's algorithm

Ukkonen's algorithm constructs an implicit suffix tree T_i for each prefix S[1...i] of S (S being the string of length n). It first builds T₁ using the 1^st character, then T₂ using the 2^nd character, then T₃ using the 3^rd character, ..., T_n using the n^th character. You can find the following characteristics in a suffix tree that uses Ukkonen's algorithm:

Implicit suffix tree T_i+1 is built on top of implicit suffix tree T_i .
At any given time, Ukkonen's algorithm builds the suffix tree for the characters seen so far and so it has on-line property, allowing the algorithm to have an execution time of O(n).
Ukkonen's algorithm is divided into n phases (one phase for each character in the string with length n).
Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1...i+1].

Suffix extension is all about adding the next character into the suffix tree built so far. In extension j of phase i+1, algorithm finds the end of S[j...i] (which is already in the tree due to previous phase i) and then it extends S[j...i] to be sure the suffix S[j...i+1] is in the tree. There are three extension rules:

If the path from the root labelled S[j...i] ends at a leaf edge (i.e., S[i] is last character on leaf edge), then character S[i+1] is just added to the end of the label on that leaf edge.
if the path from the root labelled S[j...i] ends at a non-leaf edge (i.e., there are more characters after S[i] on path) and next character is not S[i+1], then a new leaf edge with label S[i+1] and number j is created starting from character S[i+1]. A new internal node will also be created if S[1...i] ends inside (in between) a non-leaf edge.
If the path from the root labelled S[j..i] ends at a non-leaf edge (i.e., there are more characters after S[i] on path) and next character is S[i+1] (already in tree), do nothing.

One important point to note is that from a given node (root or internal), there will be one and only one edge starting from one character. There will not be more than one edge going out of any node starting with the same character.

Run time

The naive implementation for generating a suffix tree going forward requires $O (n 2)$ or even $O (n 3)$ time complexity in big O notation, where $n$ is the length of the string. By exploiting a number of algorithmic techniques, Ukkonen reduced this to $O (n)$ (linear) time, for constant-size alphabets, and $O (n log n)$ in general, matching the runtime performance of the earlier two algorithms.

Ukkonen's algorithm example

To better illustrate how a suffix tree is constructed using Ukkonen's algorithm, we can consider the string S = xabxac.

Start with an empty root node.
Construct $T_{1}$ for S[1] by adding the first character of the string. Rule 2 applies, which creates a new leaf node.
Construct $T_{2}$ for S[1..2] by adding suffixes of xa (xa and a). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node.
Construct $T_{3}$ for S[1..3] by adding suffixes of xab (xab, ab and b). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node.
Construct $T_{4}$ for S[1..4] by adding suffixes of xabx (xabx, abx, bx and x). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing.
Constructs $T_{5}$ for S[1..5] by adding suffixes of xabxa (xabxa, abxa, bxa, xa and a). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing.
Constructs $T_{6}$ for S[1..6] by adding suffixes of xabxac (xabxac, abxac, bxac, xac, ac and c). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node (in this case, three new leaf edges and two new internal nodes are created).

Related Research Articles

In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with $k = 2$ . A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Breadth-first search</span> Algorithm to search the nodes of a graph

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

In computer science, a binary decision diagram (BDD) or branching program is a data structure that is used to represent a Boolean function. On a more abstract level, BDDs can be considered as a compressed representation of sets or relations. Unlike other compressed representations, operations are performed directly on the compressed representation, i.e. without decompression.

A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The data associated with a leaf cell varies by application, but the leaf cell represents a "unit of interesting spatial information".

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are $N$ elements, this matrix will have size $N \times N$ . In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix $r$ of the radix tree, where $r$ is a positive integer and a power $x$ of 2, having $x$ ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings $of total length, it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.$

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In mathematics, an evasive Boolean function $is a Boolean function for which every decision tree algorithm has running time of exactly . Consequently, every decision tree algorithm that represents the function has, at worst case, a running time of .$

In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this.

<span class="mw-page-title-main">Unrooted binary tree</span>

In mathematics and computer science, an unrooted binary tree is an unrooted tree in which each vertex has either one or three neighbors.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References

↑ Ukkonen, E. (1995). "On-line construction of suffix trees" (PDF). Algorithmica. 14 (3): 249–260. CiteSeerX 10.1.1.10.751 . doi:10.1007/BF01206331. S2CID 6027556.
↑ Weiner, Peter (1973). "Linear pattern matching algorithms" (PDF). 14th Annual Symposium on Switching and Automata Theory (SWAT 1973). pp. 1–11. CiteSeerX 10.1.1.474.9582 . doi:10.1109/SWAT.1973.13.
↑ McCreight, Edward Meyers (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM . 23 (2): 262–272. CiteSeerX 10.1.1.130.8022 . doi:10.1145/321941.321946. S2CID 9250303.

External links

Detailed explanation in plain English
Fast String Searching With Suffix Trees Mark Nelson's tutorial. Has an implementation example written with C++.
Implementation in C with detailed explanation
Lecture slides by Guy Blelloch
Ukkonen's homepage
Text-Indexing project (Ukkonen's linear-time construction of suffix trees)
Implementation in C Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

This algorithms or data structures-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Ukkonen, E. (1995). "On-line construction of suffix trees" (PDF). Algorithmica. 14 (3): 249–260. CiteSeerX 10.1.1.10.751 . doi:10.1007/BF01206331. S2CID 6027556.

[2] Weiner, Peter (1973). "Linear pattern matching algorithms" (PDF). 14th Annual Symposium on Switching and Automata Theory (SWAT 1973). pp. 1–11. CiteSeerX 10.1.1.474.9582 . doi:10.1109/SWAT.1973.13.

[3] McCreight, Edward Meyers (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM . 23 (2): 262–272. CiteSeerX 10.1.1.130.8022 . doi:10.1145/321941.321946. S2CID 9250303.

[1]

[2]

[3]