LCP array

Last updated
LCP array
Type Array
Invented by Manber & Myers (1993)
Time complexity and space complexity in big O notation
Average Worst case
Space
Construction

In computer science, the longest common prefix array (LCP array ) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

Contents

For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix between A[1] = aab and A[2] = ab is a which has length 1, so H[2] = 1 in the LCP array H. Likewise, the LCP of A[2] = ab and A[3] = abaab is ab, so H[3] = 2.

Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the suffix tree, [1] [2] speeds up pattern matching on the suffix array [3] and is a prerequisite for compressed suffix trees. [4]

History

The LCP array was introduced in 1993, by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm. [3]

Definition

Let be the suffix array of the string of length , where is a sentinel letter that is unique and lexicographically smaller than any other character. Let denote the substring of ranging from to . Thus, is the th smallest suffix of .

Let denote the length of the longest common prefix between two strings and . Then the LCP array is an integer array of size such that is undefined and for every . Thus stores the length of longest common prefix of the lexicographically th smallest suffix and its predecessor in the suffix array.

Difference between LCP array and suffix array:

Example

Consider the string :

i1234567
S[i]banana$

and its corresponding sorted suffix array  :

i1234567
A[i]7642153

Suffix array with suffixes written out underneath vertically:

i1234567
A[i]7642153
S[A[i], n][1]$aaabnn
S[A[i], n][2]$nnaaa
S[A[i], n][3]aan$n
S[A[i], n][4]$naa
S[A[i], n][5]an$
S[A[i], n][6]$a
S[A[i], n][7]$

Then the LCP array is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:

i1234567
H[i]undefined013002

So, for example, is the length of the longest common prefix shared by the suffixes and . Note that is undefined, since there is no lexicographically smaller suffix.

Efficient construction algorithms

LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.

Manber & Myers (1993) provide an algorithm to compute the LCP array alongside the suffix array in time. Kärkkäinen & Sanders (2003) show that it is also possible to modify their time algorithm such that it computes the LCP array as well. Kasai et al. (2001) present the first time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.

Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of bytes, while the original output (text, suffix array, LCP array) only occupies bytes. Therefore, Manzini (2004) created a refined version of the algorithm of Kasai et al. (2001) (lcp9) and reduced the space occupancy to bytes. Kärkkäinen, Manzini & Puglisi (2009) provide another refinement of Kasai's algorithm (-algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the permuted LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.

Gog & Ohlebusch (2011) provide two algorithms that although being theoretically slow () were faster than the above-mentioned algorithms in practice.

As of 2012, the currently fastest linear-time LCP array construction algorithm is due to Fischer (2011), which in turn is based on one of the fastest suffix array construction algorithms (SA-IS) by Nong, Zhang & Chan (2009). Fischer & Kurpicz (2017) based on Yuta Mori's DivSufSort is even faster.

Applications

As noted by Abouelhoda, Kurtz & Ohlebusch (2004) several string processing problems can be solved by the following kinds of tree traversals:

Kasai et al. (2001) show how to simulate a bottom-up traversal of the suffix tree using only the suffix array and LCP array. Abouelhoda, Kurtz & Ohlebusch (2004) enhance the suffix array with the LCP array and additional data structures and describe how this enhanced suffix array can be used to simulate all three kinds of suffix tree traversals. Fischer & Heun (2007) reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for range minimum queries. Thus, every problem that can be solved by suffix tree algorithms can also be solved using the enhanced suffix array. [2]

Deciding if a pattern of length is a substring of a string of length takes time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to time. [3] Abouelhoda, Kurtz & Ohlebusch (2004) show how to improve this running time even further to achieve optimal time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the suffix tree.

The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and lowest common ancestor queries. [5] [6] Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv LZ77 factorization in time. [2] [7] [8] [9]

The longest repeated substring problem for a string of length can be solved in time using both the suffix array and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value and the corresponding index where is stored. The longest substring that occurs at least twice is then given by .

The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.

Find the number of occurrences of a pattern

In order to find the number of occurrences of a given string (length ) in a text (length ), [3]

The issue with using standard binary search (without the LCP information) is that in each of the comparisons needed to be made, we compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is .

The LCP-LR array helps improve this to , in the following way:

At any point during the binary search algorithm, we consider, as usual, a range of the suffix array and its central point , and decide whether we continue our search in the left sub-range or in the right sub-range . In order to make the decision, we compare to the string at . If is identical to , our search is complete. But if not, we have already compared the first characters of and then decided whether is lexicographically smaller or larger than . Let's assume the outcome is that is larger than . So, in the next step, we consider and a new central point in the middle:

             M ...... M' ...... R              |       we know:          lcp(P,M)==k

The trick now is that LCP-LR is precomputed such that an -lookup tells us the longest common prefix of and , .

We already know (from the previous step) that itself has a prefix of characters in common with : . Now there are three possibilities:

The overall effect is that no character of is compared to any character of the text more than once (for details see [3] ). The total number of character comparisons is bounded by , so the total complexity is indeed .

We still need to precompute LCP-LR so it is able to tell us in time the lcp between any two entries of the suffix array. We know the standard LCP array gives us the lcp of consecutive entries only, i.e. for any . However, and in the description above are not necessarily consecutive entries.

The key to this is to realize that only certain ranges will ever occur during the binary search: It always starts with and divides that at the center, and then continues either left or right and divide that half again and so forth. Another way of looking at it is : every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges that can possibly play a role during binary search, and it suffices to precompute and for those possible ranges. So that is distinct precomputed values, hence LCP-LR is in size.

Moreover, there is a straightforward recursive algorithm to compute the values of LCP-LR in time from the standard LCP array.

To sum up:

Suffix tree construction

Given the suffix array and the LCP array of a string of length , its suffix tree can be constructed in time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.

Let be the partial suffix tree for . Further let be the length of the concatenation of all path labels from the root of to node .

Case 1 (
d
(
v
)
=
H
[
i
+
1
]
{\displaystyle d(v)=H[i+1]}
): Suppose the suffixes
a
$
{\displaystyle a\$}
,
a
n
a
$
{\displaystyle ana\$}
,
a
n
a
n
a
$
{\displaystyle anana\$}
and
b
a
n
a
n
a
$
{\displaystyle banana\$}
of the string
S
=
b
a
n
a
n
a
$
{\displaystyle S=banana\$}
are already added to the suffix tree. Then the suffix
n
a
$
{\displaystyle na\$}
is added to the tree as shown in the picture. The rightmost path is highlighted in red. Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 1.pdf
Case 1 (): Suppose the suffixes , , and of the string are already added to the suffix tree. Then the suffix is added to the tree as shown in the picture. The rightmost path is highlighted in red.

Start with , the tree consisting only of the root. To insert into , walk up the rightmost path beginning at the recently inserted leaf to the root, until the deepest node with is reached.

We need to distinguish two cases:

  1. Delete the edge .
  2. Add a new internal node and a new edge with label . The new label consists of the missing characters of the longest common prefix of and . Thus, the concatenation of the labels of the root-to- path now displays the longest common prefix of and .
  3. Connect to the newly created internal node by an edge that is labeled . The new label consists of the remaining characters of the deleted edge that were not used as the label of edge .
  4. Add as a new leaf and connect it to the new internal node by an edge that is labeled . Thus the edge label consists of the remaining characters of suffix that are not already represented by the concatenation of the labels of the root-to- path.
  5. This creates the partial suffix tree .

A simple amortization argument shows that the running time of this algorithm is bounded by :

The nodes that are traversed in step by walking up the rightmost path of (apart from the last node ) are removed from the rightmost path, when is added to the tree as a new leaf. These nodes will never be traversed again for all subsequent steps . Therefore, at most nodes will be traversed in total.

LCP queries for arbitrary suffixes

The LCP array only contains the length of the longest common prefix of every pair of consecutive suffixes in the suffix array . However, with the help of the inverse suffix array (, i.e. the suffix that starts at position in is stored in position in ) and constant-time range minimum queries on , it is possible to determine the length of the longest common prefix of arbitrary suffixes in time.

Because of the lexicographic order of the suffix array, every common prefix of the suffixes and has to be a common prefix of all suffixes between 's position in the suffix array and 's position in the suffix array . Therefore, the length of the longest prefix that is shared by all of these suffixes is the minimum value in the interval . This value can be found in constant time if is preprocessed for range minimum queries.

Thus given a string of length and two arbitrary positions in the string with , the length of the longest common prefix of the suffixes and can be computed as follows: .

Notes

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is an important case where m = 2; similarly, a ternary tree is one where m = 3.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.

In computer science, a range minimum query (RMQ) solves the problem of finding the minimal value in a sub-array of an array of comparable objects. Range minimum queries have several use cases in computer science, such as the lowest common ancestor problem and the longest common prefix problem (LCP).

In computer science, a compressed suffix array is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings of total length , it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References