In computer science, Ukkonen's algorithm is a linear-time, online algorithm for constructing suffix trees, proposed by Esko Ukkonen in 1995. [1] The algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string, adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property. The original algorithm presented by Peter Weiner proceeded backward from the last character to the first one from the shortest to the longest suffix. [2] A simpler algorithm was found by Edward M. McCreight, going from the longest to the shortest suffix. [3]
While generating suffix tree using Ukkonen's algorithm, we will see implicit suffix tree in intermediate steps depending on characters in string S. In implicit suffix trees, there will be no edge with $ (or any other termination character) label and no internal node with only one edge going out of it.
Ukkonen's algorithm constructs an implicit suffix tree Ti for each prefix S[1...i] of S (S being the string of length n). It first builds T1 using the 1st character, then T2 using the 2nd character, then T3 using the 3rd character, ..., Tn using the nth character. You can find the following characteristics in a suffix tree that uses Ukkonen's algorithm:
Suffix extension is all about adding the next character into the suffix tree built so far. In extension j of phase i+1, algorithm finds the end of S[j...i] (which is already in the tree due to previous phase i) and then it extends S[j...i] to be sure the suffix S[j...i+1] is in the tree. There are three extension rules:
One important point to note is that from a given node (root or internal), there will be one and only one edge starting from one character. There will not be more than one edge going out of any node starting with the same character.
The naive implementation for generating a suffix tree going forward requires O(n2) or even O(n3) time complexity in big O notation, where n is the length of the string. By exploiting a number of algorithmic techniques, Ukkonen reduced this to O(n) (linear) time, for constant-size alphabets, and O(n log n) in general, matching the runtime performance of the earlier two algorithms.
To better illustrate how a suffix tree is constructed using Ukkonen's algorithm, we can consider the string S = xabxac
.
S[1]
by adding the first character of the string. Rule 2 applies, which creates a new leaf node.S[1..2]
by adding suffixes of xa
(xa
and a
). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node.S[1..3]
by adding suffixes of xab
(xab
, ab
and b
). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node.S[1..4]
by adding suffixes of xabx
(xabx
, abx
, bx
and x
). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing.S[1..5]
by adding suffixes of xabxa
(xabxa
, abxa
, bxa
, xa
and a
). Rule 1 applies, which extends the path label in existing leaf edge. Rule 3 applies, do nothing.S[1..6]
by adding suffixes of xabxac
(xabxac
, abxac
, bxac
, xac
, ac
and c
). Rule 1 applies, which extends the path label in existing leaf edge. Rule 2 applies, which creates a new leaf node (in this case, three new leaf edges and two new internal nodes are created).In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child. That is, it is a k-ary tree with k = 2. A recursive definition using set theory is that a binary tree is a tuple (L, S, R), where L and R are binary trees or the empty set and S is a singleton set containing the root.
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.
Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.
In computer science, a binary decision diagram (BDD) or branching program is a data structure that is used to represent a Boolean function. On a more abstract level, BDDs can be considered as a compressed representation of sets or relations. Unlike other compressed representations, operations are performed directly on the compressed representation, i.e. without decompression.
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The data associated with a leaf cell varies by application, but the leaf cell represents a "unit of interesting spatial information".
In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.
In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.
In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.
In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.
In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.
In mathematics, an evasive Boolean function is a Boolean function for which every decision tree algorithm has running time of exactly . Consequently, every decision tree algorithm that represents the function has, at worst case, a running time of .
In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this.
In mathematics and computer science, an unrooted binary tree is an unrooted tree in which each vertex has either one or three neighbors.
In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.
In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.
In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.