Suffix tree

Last updated
Suffix tree for the text BANANA. Each substring is terminated with special character $. The six paths from the root to the leaves (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction. Suffix tree BANANA.svg
Suffix tree for the text BANANA. Each substring is terminated with special character $. The six paths from the root to the leaves (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction.

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

Contents

The construction of such a tree for the string takes time and space linear in the length of . Once constructed, several operations can be performed quickly, such as locating a substring in , locating a substring if a certain number of mistakes are allowed, and locating matches for a regular expression pattern. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. [2] These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

The concept was first introduced by Weiner (1973). Rather than the suffix , Weiner stored in his trie [3] the prefix identifier for each position, that is, the shortest string starting at and occurring only once in . His Algorithm D takes an uncompressed [4] trie for and extends it into a trie for . This way, starting from the trivial trie for , a trie for can be built by successive calls to Algorithm D; however, the overall run time is . Weiner's Algorithm B maintains several auxiliary data structures, to achieve an over all run time linear in the size of the constructed trie. The latter can still be nodes, e.g. for Weiner's Algorithm C finally uses compressed tries to achieve linear overall storage size and run time. [5] Donald Knuth subsequently characterized the latter as "Algorithm of the Year 1973" according to his student Vaughan Pratt.[ original research? ] [6] The text book Aho, Hopcroft & Ullman (1974 , Sect.9.5) reproduced Weiner's results in a simplified and more elegant form, introducing the term position tree.

McCreight (1976) was the first to build a (compressed) trie of all suffixes of . Although the suffix starting at is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size. On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained.

Ukkonen (1995) further simplified the construction. [6] He provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of in general.

Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.

Definition

The suffix tree for the string of length is defined as a tree such that: [7]

  1. The tree has exactly n leaves numbered from to .
  2. Except for the root, every internal node has at least two children.
  3. Each edge is labelled with a non-empty substring of .
  4. No two edges starting out of a node can have string-labels beginning with the same character.
  5. The string obtained by concatenating all the string-labels found on the path from the root to leaf spells out suffix , for from to .

If the a suffix of is also the prefix of another suffix, such a tree does not exist for the string. For example, in the string abcbc, the suffix bc is also a prefix of the suffix bcbc. In such a case, the path spelling out bc will not end in a leaf, violating the fifth rule. To fix this problem, is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be leaf nodes, one for each of the suffixes of . [8] Since all internal non-root nodes are branching, there can be at most such nodes, and nodes in total ( leaves, internal non-root nodes, 1 root).

Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string , where is a single character and is a string (possibly empty), it has a suffix link to the internal node representing . See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

A generalized suffix tree is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.

Functionality

A suffix tree for a string of length can be built in time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets). [9] For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size ; in general, this takes time. The costs below are given under the assumption that the alphabet is constant.

Assume that a suffix tree has been built for the string of length , or that a generalised suffix tree has been built for the set of strings of total length . You can:

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in time. [17] One can then also:

Applications

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas. [25] Primary applications include: [25]

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines. [26]

Implementation

If each node and edge can be represented in space, the entire tree can be represented in space. The total length of all the strings on all of the edges in the tree is , but each edge can be stored as the position and length of a substring of S, giving a total space usage of computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties use hash maps, sorted or unsorted arrays (with array doubling), or balanced search trees. We are interested in:

Let σ be the size of the alphabet. Then you have the following costs:

LookupInsertionTraversal
Sibling lists / unsorted arraysO(σ)Θ(1)Θ(1)
Bitwise sibling treesO(log σ)Θ(1)Θ(1)
Hash mapsΘ(1)Θ(1)O(σ)
Balanced search treeO(log σ)O(log σ)O(1)
Sorted arraysO(log σ)O(σ)O(1)
Hash maps + sibling listsO(1)O(1)O(1)

The insertion cost is amortised, and that the costs for hashing are given for perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in some UNIX-like systems, see wchar_t ) on 32-bit systems. Researchers have continued to find smaller indexing structures.

Parallel construction

Various parallel algorithms to speed up suffix tree construction have been proposed. [27] [28] [29] [30] [31] Recently, a practical parallel algorithm for suffix tree construction with work (sequential time) and span has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index the human genome – approximately 3GB – in under 3 minutes using a 40-core machine. [32]

External construction

Though linear, the memory usage of a suffix tree is significantly higher than the actual size of the sequence collection. For a large text, construction may require external memory approaches.

There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting. However the overall intricacy of this algorithm has prevented, so far, its practical implementation. [33]

On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD, [34] TRELLIS, [35] DiGeST, [36] and B2ST. [37]

TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes. [34] [35] However, these methods cannot handle efficiently collections of sequences exceeding 3GB. [36] DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours. [36]

All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the input does. The most recent method, B2ST, [37] scales to handle inputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16GB RAM. On a simple Linux cluster with 16 nodes (4GB RAM per node), ERA can index the entire human genome in less than 9 minutes. [38]

See also

Notes

  1. Donald E. Knuth; James H. Morris; Vaughan R. Pratt (Jun 1977). "Fast Pattern Matching in Strings" (PDF). SIAM Journal on Computing. 6 (2): 323–350. doi:10.1137/0206024. Here: p.339 bottom.
  2. Knuth conjectured in 1970 that the problem could not be solved in linear time. [1] In 1973, this was refuted by Weiner's suffix-tree algorithm Weiner (1973).
  3. This term is used here to distinguish Weiner's precursor data structures from proper suffix trees as defined above and unconsidered before McCreight (1976).
  4. i.e., with each branch labelled by a single character
  5. See File:WeinerB aaaabbbbaaaabbbb.gif and File:WeinerC aaaabbbbaaaabbbb.gif for an uncompressed example tree and its compressed correspondant.
  6. 1 2 Giegerich & Kurtz (1997).
  7. Gusfield (1999) , p.90.
  8. Gusfield (1999) , p.90-91.
  9. Farach (1997).
  10. Gusfield (1999) , p.92.
  11. Gusfield (1999) , p.123.
  12. Baeza-Yates & Gonnet (1996).
  13. Gusfield (1999) , p.132.
  14. Gusfield (1999) , p.125.
  15. Gusfield (1999) , p.144.
  16. Gusfield (1999) , p.166.
  17. Gusfield (1999) , Chapter 8.
  18. Gusfield (1999) , p.196.
  19. Gusfield (1999) , p.200.
  20. Gusfield (1999) , p.198.
  21. Gusfield (1999) , p.201.
  22. Gusfield (1999) , p.204.
  23. Gusfield (1999) , p.205.
  24. Gusfield (1999) , pp.197–199.
  25. 1 2 Allison, L. "Suffix Trees". Archived from the original on 2008-10-13. Retrieved 2008-10-14.
  26. First introduced by Zamir & Etzioni (1998).
  27. Apostolico et al. (1988).
  28. Hariharan (1994).
  29. Sahinalp & Vishkin (1994).
  30. Farach & Muthukrishnan (1996).
  31. Iliopoulos & Rytter (2004).
  32. Shun & Blelloch (2014).
  33. Smyth (2003).
  34. 1 2 Tata, Hankins & Patel (2003).
  35. 1 2 Phoophakdee & Zaki (2007).
  36. 1 2 3 Barsky et al. (2008).
  37. 1 2 Barsky et al. (2009).
  38. Mansour et al. (2011).

Related Research Articles

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

<span class="mw-page-title-main">Aho–Corasick algorithm</span> String-searching algorithm

In computer science, the Aho—Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

<span class="mw-page-title-main">Substring</span> Contiguous part of a sequence of symbols

In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "the best of" is a substring of "It was the best of times". In contrast, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

<span class="mw-page-title-main">Generalized suffix tree</span>

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

<span class="mw-page-title-main">Longest repeated substring problem</span>

In computer science, the longest repeated substring problem is the problem of finding the longest substring of a string that occurs at least twice.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

In computer science, a maximal pair within a string is a pair of matching substrings that are maximal, where "maximal" means that it is not possible to make a longer matching pair by extending the range of both substrings to the left or right.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

<span class="mw-page-title-main">Wavelet Tree</span>

The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the and operations defined on bitvectors to arbitrary alphabets.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings of total length , it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References