Longest common substring

Last updated

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

Contents

Examples

The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA". LongestSubstring svg.svg
The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA".

The picture shows two strings where the problem has multiple solutions. Although the substring occurrences always overlap, it is impossible to obtain a longer common substring by "uniting" them.

The strings "ABABC", "BABCA" and "ABCBA" have only one longest common substring, viz. "ABC" of length 3. Other common substrings are "A", "AB", "B", "BA", "BC" and "C".

  ABABC     |||    BABCA     |||     ABCBA

Problem definition

Given two strings, of length and of length , find a longest string which is substring of both and .

A generalization is the k-common substring problem. Given the set of strings , where and . Find for each , a longest string which occurs as substring of at least strings.

Algorithms

One can find the lengths and starting positions of the longest common substrings of and in time with the help of a generalized suffix tree. A faster algorithm can be achieved in the word RAM model of computation if the size of the input alphabet is in . In particular, this algorithm runs in time using space. [1] Solving the problem by dynamic programming costs . The solutions to the generalized problem take space and time with dynamic programming and take time with a generalized suffix tree.

Suffix tree

Generalized suffix tree for the strings "ABAB", "BABA" and "ABBA", numbered 0, 1 and 2. Suffix tree ABAB BABA ABBA.svg
Generalized suffix tree for the strings "ABAB", "BABA" and "ABBA", numbered 0, 1 and 2.

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it. The figure on the right is the suffix tree for the strings "ABAB", "BABA" and "ABBA", padded with unique string terminators, to become "ABAB$0", "BABA$1" and "ABBA$2". The nodes representing "A", "B", "AB" and "BA" all have descendant leaves from all of the strings, numbered 0, 1 and 2.

Building the suffix tree takes time (if the size of the alphabet is constant). If the tree is traversed from the bottom up with a bit vector telling which strings are seen below each node, the k-common substring problem can be solved in time. If the suffix tree is prepared for constant time lowest common ancestor retrieval, it can be solved in time. [2]

Dynamic programming

The following pseudocode finds the set of longest common substrings between two strings with dynamic programming:

function LongestCommonSubstring(S[1..r], T[1..n])     L := array(1..r, 1..n)     z := 0     ret := {}      for i := 1..r         for j := 1..n             if S[i] = T[j]                 if i = 1 or j = 1                     L[i, j] := 1                 else                     L[i, j] := L[i − 1, j − 1] + 1                 if L[i, j] > z                     z := L[i, j]                     ret := {S[i − z + 1..i]}                 elseif L[i, j] = z                     ret := ret ∪ {S[i − z + 1..i]}             else                 L[i, j] := 0     return ret

This algorithm runs in time. The array L stores the length of the longest common substring of the prefixes S[1..i] and T[1..j] which end at positioni and j, respectively. The variable z is used to hold the length of the longest common substring found so far. The set ret is used to hold the set of strings which are of length z. The set ret can be saved efficiently by just storing the index i, which is the last character of the longest common substring (of size z) instead of S[i-z+1..i]. Thus all the longest common substrings would be, for each i in ret, S[(ret[i]-z)..(ret[i])].

The following tricks can be used to reduce the memory usage of an implementation:

See also

Related Research Articles

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

<span class="mw-page-title-main">Schönhage–Strassen algorithm</span> Multiplication algorithm

The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm for large integers, published by Arnold Schönhage and Volker Strassen in 1971. It works by recursively applying fast Fourier transform (FFT) over the integers modulo 2n+1. The run-time bit complexity to multiply two n-digit numbers using the algorithm is in big O notation.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

<span class="mw-page-title-main">Generalized suffix tree</span>

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.

<span class="mw-page-title-main">Longest repeated substring problem</span>

In computer science, the longest repeated substring problem is the problem of finding the longest substring of a string that occurs at least twice.

In computer science, a maximal pair within a string is a pair of matching substrings that are maximal, where "maximal" means that it is not possible to make a longer matching pair by extending the range of both substrings to the left or right.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings of total length , it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References

  1. Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub (Aug 2021). Mutzel, Petra; Pagh, Rasmus; Herman, Grzegorz (eds.). Faster Algorithms for Longest Common Substring. European Symposium on Algorithms. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 204. Schloss Dagstuhl. doi:10.4230/LIPIcs.ESA.2021.30. Here: Theorem 1, p.30:2.
  2. Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN   0-521-58519-8.