Suffix array

Last updated
Suffix array
Type Array
Invented by Manber & Myers (1990)
Time complexity
in big O notation
Average Worst case
Space
Construction

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

Contents

Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They had independently been discovered by Gaston Gonnet in 1987 under the name PAT array( Gonnet, Baeza-Yates & Snider 1992 ).

Li, Li & Huo (2016) gave the first in-place time suffix array construction algorithm that is optimal both in time and space, where in-place means that the algorithm only needs additional space beyond the input string and the output suffix array.

Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity. [1] The suffix array for a subset of all suffixes of a string is called sparse suffix array. [2] Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm. [3]

Definition

Let be an -string and let denote the substring of ranging from to inclusive.

The suffix array of is now defined to be an array of integers providing the starting positions of suffixes of in lexicographical order. This means, an entry contains the starting position of the -th smallest suffix in and thus for all : .

Each suffix of shows up in exactly once. Suffixes are simple strings. These strings are sorted (as in a paper dictionary), before their starting positions (integer indices) are saved in .

Example

Consider the text =banana$ to be indexed:

i1234567
banana$

The text ends with the special sentinel letter $ that is unique and lexicographically smaller than any other character. The text has the following suffixes:

Suffixi
banana$1
anana$2
nana$3
ana$4
na$5
a$6
$7

These suffixes can be sorted in ascending order:

Suffixi
$7
a$6
ana$4
anana$2
banana$1
na$5
nana$3

The suffix array contains the starting positions of these sorted suffixes:

i =1234567
=7642153

The suffix array with the suffixes written out vertically underneath for clarity:

i =1234567
=7642153
1$aaabnn
2$nnaaa
3aan$n
4$naa
5an$
6$a
7$

So for example, contains the value 4, and therefore refers to the suffix starting at position 4 within , which is the suffix ana$.

Correspondence to suffix trees

Suffix arrays are closely related to suffix trees:

It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as the LCP array) and solves the same problem in the same time complexity. [1] Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to Ukkonen's algorithm) and improved cache locality. [4]

Space efficiency

Suffix arrays were introduced by Manber & Myers (1990) in order to improve over the space requirements of suffix trees: Suffix arrays store integers. Assuming an integer requires bytes, a suffix array requires bytes in total. This is significantly less than the bytes which are required by a careful suffix tree implementation. [5]

However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires space, whereas the original text over an alphabet of size only requires bits. For a human genome with and the suffix array would therefore occupy about 16 times more memory than the genome itself.

Such discrepancies motivated a trend towards compressed suffix arrays and BWT-based compressed full-text indices such as the FM-index. These data structures require only space within the size of the text or even less.

Construction algorithms

A suffix tree can be built in and can be converted into a suffix array by traversing the tree depth-first also in , so there exist algorithms that can build a suffix array in .

A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms require suffix comparisons, but a suffix comparison runs in time, so the overall runtime of this approach is .

More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals: [6]

One of the first algorithms to achieve all goals is the SA-IS algorithm of Nong, Zhang & Chan (2009). The algorithm is also rather simple (< 100 LOC) and can be enhanced to simultaneously construct the LCP array. [7] The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful implementation by Yuta Mori [8] outperforms most other linear or super-linear construction approaches.

Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported alphabet: constant alphabets where the alphabet size is bound by a constant, integer alphabets where characters are integers in a range depending on and general alphabets where only character comparisons are allowed. [9]

Most suffix array construction algorithms are based on one of the following approaches: [6]

A well-known recursive algorithm for integer alphabets is the DC3 / skew algorithm of Kärkkäinen & Sanders (2003). It runs in linear time and has successfully been used as the basis for parallel [10] and external memory [11] suffix array construction algorithms.

Recent work by Salson et al. (2010) proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is , it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.

In practical open source work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm. [12] This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka. [13] In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov [14] which in average showed 65% performance improvement over DivSufSort implementation on Silesia Corpus. [15]

Generalized suffix array

The concept of a suffix array can be extended to more than one string. This is called a generalized suffix array (or GSA), a suffix array that contains all suffixes for a set of strings (for example, and is lexicographically sorted with all suffixes of each string. [16]

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern within the string . Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two binary searches. The first search locates the starting position of the interval, and the second one determines the end position: [ citation needed ]

n=len(S)defsearch(P:str)->Tuple[int,int]:"""    Return indices (s, r) such that the interval A[s:r] (including the end    index) represents all suffixes of S that start with the pattern P.    """# Find starting position of intervall=0# in Python, arrays are indexed starting at 0r=nwhilel<r:mid=(l+r)//2# division rounding down to nearest integer# suffixAt(A[i]) is the ith smallest suffixifP>suffixAt(A[mid]):l=mid+1else:r=mids=l# Find ending position of intervalr=nwhilel<r:mid=(l+r)//2ifsuffixAt(A[mid]).startswith(P):l=mid+1else:r=midreturn(s,r)

Finding the substring pattern of length in the string of length takes time, given that a single suffix comparison needs to compare characters. Manber & Myers (1990) describe how this bound can be improved to time using LCP information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. Abouelhoda, Kurtz & Ohlebusch (2004) improve the bound even further and achieve a search time of for constant alphabet size, as known from suffix trees.

Suffix sorting algorithms can be used to compute the Burrows–Wheeler transform (BWT). The BWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array. The BWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the BWT string: .

Suffix arrays can also be used to look up substrings in example-based machine translation, demanding much less storage than a full phrase table as used in Statistical machine translation.

Many additional applications of the suffix array require the LCP array. Some of these are detailed in the application section of the latter.

Enhanced suffix arrays

Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that require processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix arrays are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).

The suffix array of the string is an array of n integers in the range of 0 to n that represents the n+1 suffixes of the string including the special character #.

The suffix array is composed of two arrays:

  1. pos array pos[1,...n]: It represents a sorted list of all S suffixes. Only the initial positions of the suffixes are stored in the array to reduce the space complexity since the suffixes are too large.
  2. lcp array lcp[1,...n]: It is an array of n integers that maintains the lengths of the longest common prefix of two consecutive suffixes stored in the pos array.

Constructing the lcp-interval

For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:

Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if

1. lcptab[i] < l,

2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,

3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,

4. lcptab[j + 1] < l.

The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].

Constructing a child table

The child table cldtab is composed of three n arrays, up, down and nextlIndex. The information about the edges of the corresponding suffix tree is stored and maintained by the up and down arrays. The nextlIndex array stores the links in the linked list used for node branching the suffix tree.

The up, down and nextlIndex array are defined as follows:

  1. The element up[i] records the starting index of the longest lcp-second interval’s child interval, which ends at index i-1.
  2. The initial index of the second child interval of the longest lcp-interval, starting at index i is stored in the element down[i].
  3. If and only if the interval is neither the first child nor the final child of its parent, the element nextlIndex[i] contains the first index of the next sibling interval of the longest lcp-interval, starting at index i.

By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. The up/down values and the nextlIndex values can be computed separately by using two distinct algorithms.

The suffix links for an enhanced suffix array can be computed by generating the suffix link interval [1,..,r] for each [i,..j] interval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of [i,..,j]. The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time an l-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for every l-interval[i,..,j] in the list, link[i] is calculated. The interval [l,..,r] is computed by a binary search in(l-1)-list, where l is the largest left boundary amongst all the l-1 intervals. The suffix link interval of [i,..j] is represented by this interval[l,..,r]. The values l and r are ultimately stored in the first index of [i,..,j].

Notes

  1. 1 2 Abouelhoda, Kurtz & Ohlebusch 2004.
  2. I, Kärkkäinen & Kempa 2014.
  3. Gawrychowski & Kociumaka 2017.
  4. Abouelhoda, Kurtz & Ohlebusch 2002.
  5. Kurtz 1999.
  6. 1 2 Puglisi, Smyth & Turpin 2007.
  7. Fischer 2011.
  8. Mori, Yuta. "sais". Archived from the original on 9 Mar 2023. Retrieved 31 Aug 2023.
  9. Burkhardt & Kärkkäinen 2003.
  10. Kulla & Sanders 2007.
  11. Dementiev et al. 2008.
  12. Larsson, N. Jesper; Sadakane, Kunihiko (22 November 2007). "Faster suffix sorting". Theoretical Computer Science. 387 (3): 258–272. doi: 10.1016/j.tcs.2007.07.017 . ISSN   0304-3975.
  13. Fischer, Johannes; Kurpicz, Florian (5 October 2017). "Dismantling DivSufSort". Proceedings of the Prague Stringology Conference 2017. arXiv: 1710.01896 .
  14. "New saca and bwt library (libsais)". encode.su. Retrieved 2021-10-03.
  15. Grebnov, Ilya (2021-09-22), libsais , retrieved 2021-10-02
  16. Shi 1996.

Related Research Articles

<span class="mw-page-title-main">Binary search</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

<span class="mw-page-title-main">Generalized suffix tree</span>

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In computer science, a range minimum query (RMQ) solves the problem of finding the minimal value in a sub-array of an array of comparable objects. Range minimum queries have several use cases in computer science, such as the lowest common ancestor problem and the longest common prefix problem (LCP).

In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini, who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.

In computer science, a compressed suffix array is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index.

<span class="mw-page-title-main">Wavelet Tree</span>

The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the and operations defined on bitvectors to arbitrary alphabets.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

An ordinary fractal string is a bounded, open subset of the real number line. Such a subset can be written as an at-most-countable union of connected open intervals with associated lengths written in non-increasing order; we also refer to as a fractal string. For example, is a fractal string corresponding to the Cantor set. A fractal string is the analogue of a one-dimensional "fractal drum," and typically the set has a boundary which corresponds to a fractal such as the Cantor set. The heuristic idea of a fractal string is to study a (one-dimensional) fractal using the "space around the fractal." It turns out that the sequence of lengths of the set itself is "intrinsic," in the sense that the fractal string itself contains information about the fractal to which it corresponds.

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings of total length , it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References