Longest common subsequence

Last updated
Comparison of two revisions of an example file, based on their longest common subsequence (black) Nubio Diff Screenshot3.png
Comparison of two revisions of an example file, based on their longest common subsequence (black)

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

Contents

For example, consider the sequences (ABCD) and (ACBAD). They have five length-2 common subsequences: (AB), (AC), (AD), (BD), and (CD); two length-3 common subsequences: (ABD) and (ACD); and no longer common subsequences. So (ABD) and (ACD) are their longest common subsequences.

Complexity

For the general case of an arbitrary number of input sequences, the problem is NP-hard. [1] When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming.

Given sequences of lengths , a naive search would test each of the subsequences of the first sequence to determine whether they are also subsequences of the remaining sequences; each subsequence may be tested in time linear in the lengths of the remaining sequences, so the time for this algorithm would be

For the case of two sequences of n and m elements, the running time of the dynamic programming approach is O(n × m). [2] For an arbitrary number of input sequences, the dynamic programming approach gives a solution in

There exist methods with lower complexity, [3] which often depend on the length of the LCS, the size of the alphabet, or both.

The LCS is not necessarily unique; in the worst case, the number of common subsequences is exponential in the lengths of the inputs, so the algorithmic complexity must be at least exponential. [4]

Solution for two sequences

The LCS problem has an optimal substructure: the problem can be broken down into smaller, simpler subproblems, which can, in turn, be broken down into simpler subproblems, and so on, until, finally, the solution becomes trivial. LCS in particular has overlapping subproblems: the solutions to high-level subproblems often reuse solutions to lower level subproblems. Problems with these two properties are amenable to dynamic programming approaches, in which subproblem solutions are memoized, that is, the solutions of subproblems are saved for reuse.

Prefixes

The prefix Sn of S is defined as the first n characters of S. [5] For example, the prefixes of S = (AGCA) are

S0 = ()
S1 = (A)
S2 = (AG)
S3 = (AGC)
S4 = (AGCA).

Let LCS(X, Y) be a function that computes a longest subsequence common to X and Y. Such a function has two interesting properties.

First property

LCS(X^A,Y^A) = LCS(X,Y)^A, for all strings X, Y and all symbols A, where ^ denotes string concatenation. This allows one to simplify the LCS computation for two sequences ending in the same symbol. For example, LCS("BANANA","ATANA") = LCS("BANAN","ATAN")^"A", Continuing for the remaining common symbols, LCS("BANANA","ATANA") = LCS("BAN","AT")^"ANA".

Second property

If A and B are distinct symbols (AB), then LCS(X^A,Y^B) is one of the maximal-length strings in the set { LCS(X^A,Y), LCS(X,Y^B) }, for all strings X, Y.

For example, LCS("ABCDEFG","BCDGK") is the longest string among LCS("ABCDEFG","BCDG") and LCS("ABCDEF","BCDGK"); if both happened to be of equal length, one of them could be chosen arbitrarily.

To realize the property, distinguish two cases:

LCS function defined

Let two sequences be defined as follows: and . The prefixes of are ; the prefixes of are . Let represent the set of longest common subsequence of prefixes and . This set of sequences is given by the following.

To find the LCS of and , compare and . If they are equal, then the sequence is extended by that element, . If they are not equal, then the longest among the two sequences, , and , is retained. (If they are the same length, but not identical, then both are retained.) The base case, when either or is empty, is the empty string, .

Worked example

The longest subsequence common to R = (GAC), and C = (AGCAT) will be found. Because the LCS function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences: R0 = ε; and C0 = ε. All the prefixes are placed in a table with C in the first row (making it a column header) and R in the first column (making it a row header).

LCS Strings
εAGCAT
εεεεεεε
Gε
Aε
Cε

This table is used to store the LCS sequence for each step of the calculation. The second column and second row have been filled in with ε, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.

LCS(R1, C1) is determined by comparing the first elements in each sequence. G and A are not the same, so this LCS gets (using the "second property") the longest of the two sequences, LCS(R1, C0) and LCS(R0, C1). According to the table, both of these are empty, so LCS(R1, C1) is also empty, as shown in the table below. The arrows indicate that the sequence comes from both the cell above, LCS(R0, C1) and the cell on the left, LCS(R1, C0).

LCS(R1, C2) is determined by comparing G and G. They match, so G is appended to the upper left sequence, LCS(R0, C1), which is (ε), giving (εG), which is (G).

For LCS(R1, C3), G and C do not match. The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these, LCS(R1, C3) is (G). The arrow points to the left, since that is the longest of the two sequences.

LCS(R1, C4), likewise, is (G).

LCS(R1, C5), likewise, is (G).

"G" Row Completed
εAGCAT
εεεεεεε
Gεε(G)(G)(G)(G)
Aε
Cε

For LCS(R2, C1), A is compared with A. The two elements match, so A is appended to ε, giving (A).

For LCS(R2, C2), A and G do not match, so the longest of LCS(R1, C2), which is (G), and LCS(R2, C1), which is (A), is used. In this case, they each contain one element, so this LCS is given two subsequences: (A) and (G).

For LCS(R2, C3), A does not match C. LCS(R2, C2) contains sequences (A) and (G); LCS(R1, C3) is (G), which is already contained in LCS(R2, C2). The result is that LCS(R2, C3) also contains the two subsequences, (A) and (G).

For LCS(R2, C4), A matches A, which is appended to the upper left cell, giving (GA).

For LCS(R2, C5), A does not match T. Comparing the two sequences, (GA) and (G), the longest is (GA), so LCS(R2, C5) is (GA).

"G" & "A" Rows Completed
εAGCAT
εεεεεεε
Gεε(G)(G)(G)(G)
Aε(A)(A) & (G)(A) & (G)(GA)(GA)
Cε

For LCS(R3, C1), C and A do not match, so LCS(R3, C1) gets the longest of the two sequences, (A).

For LCS(R3, C2), C and G do not match. Both LCS(R3, C1) and LCS(R2, C2) have one element. The result is that LCS(R3, C2) contains the two subsequences, (A) and (G).

For LCS(R3, C3), C and C match, so C is appended to LCS(R2, C2), which contains the two subsequences, (A) and (G), giving (AC) and (GC).

For LCS(R3, C4), C and A do not match. Combining LCS(R3, C3), which contains (AC) and (GC), and LCS(R2, C4), which contains (GA), gives a total of three sequences: (AC), (GC), and (GA).

Finally, for LCS(R3, C5), C and T do not match. The result is that LCS(R3, C5) also contains the three sequences, (AC), (GC), and (GA).

Completed LCS Table
εAGCAT
εεεεεεε
Gεε(G)(G)(G)(G)
Aε(A)(A) & (G)(A) & (G)(GA)(GA)
Cε(A)(A) & (G)(AC) & (GC)(AC) & (GC) & (GA)(AC) & (GC) & (GA)

The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA). The table also shows the longest common subsequences for every possible pair of prefixes. For example, for (AGC) and (GA), the longest common subsequence are (A) and (G).

Traceback approach

Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row. Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space. Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.

Storing length, rather than sequences
εAGCAT
ε000000
G001111
A011122
C011222

The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table. When the length decreases, the sequences must have had a common element. Several paths are possible when two arrows are shown in a cell. Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease. The bold numbers trace out the sequence, (GA). [6]

Traceback example
εAGCAT
ε000000
G001111
A011122
C011222

Relation to other problems

For two strings and , the length of the shortest common supersequence is related to the length of the LCS by [3]

The edit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is:

Code for the dynamic programming solution

Computing the length of the LCS

The function below takes as input sequences X[1..m] and Y[1..n], computes the LCS between X[1..i] and Y[1..j] for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, and stores it in C[i,j]. C[m,n] will contain the length of the LCS of X and Y. [7]

function LCSLength(X[1..m], Y[1..n])     C = array(0..m, 0..n)     for i := 0..m         C[i,0] = 0     for j := 0..n         C[0,j] = 0     for i := 1..m         for j := 1..n             if X[i] = Y[j]                 C[i,j] := C[i-1,j-1] + 1             else                 C[i,j] := max(C[i,j-1], C[i-1,j])     return C[m,n]

Alternatively, memoization could be used.

Reading out a LCS

The following function backtracks the choices taken when computing the C table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping and , and make the same choice. Just choose one if they were equally long. Call the function with i=m and j=n.

function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)     if i = 0 or j = 0         return ""     if  X[i] = Y[j]         return backtrack(C, X, Y, i-1, j-1) + X[i]     if C[i,j-1] > C[i-1,j]         return backtrack(C, X, Y, i, j-1)     return backtrack(C, X, Y, i-1, j)

Reading out all LCSs

If choosing and would give an equally long result, read out both resulting subsequences. This is returned as a set by this function. Notice that this function is not polynomial, as it might branch in almost every step if the strings are similar.

function backtrackAll(C[0..m,0..n], X[1..m], Y[1..n], i, j)     if i = 0 or j = 0         return {""}     if X[i] = Y[j]         return {Z + X[i] for all Z in backtrackAll(C, X, Y, i-1, j-1)}     R := {}     if C[i,j-1] ≥ C[i-1,j]         R := backtrackAll(C, X, Y, i, j-1)     if C[i-1,j] ≥ C[i,j-1]         R := R ∪ backtrackAll(C, X, Y, i-1, j)     return R

This function will backtrack through the C matrix, and print the diff between the two sequences. Notice that you will get a different answer if you exchange and <, with > and below.

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)     if i >= 0 and j >= 0 and X[i] = Y[j]         printDiff(C, X, Y, i-1, j-1)         print "  " + X[i]     else if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j])         printDiff(C, X, Y, i, j-1)         print "+ " + Y[j]     else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])         printDiff(C, X, Y, i-1, j)         print "- " + X[i]     else         print ""

Example

Let be “XMJYAUZ” and be “MZJAWXU”. The longest common subsequence between and is “MJAU”. The table C shown below, which is generated by the function LCSLength, shows the lengths of the longest common subsequences between prefixes of and . The th row and th column shows the length of the LCS between and .

01234567
εMZJAWXU
0ε00000000
1X00000011
2M01111111
3J01122222
4Y01122222
5A01123333
6U01123334
7Z01223334

The highlighted numbers show the path the function backtrack would follow from the bottom right to the top left corner, when reading out an LCS. If the current symbols in and are equal, they are part of the LCS, and we go both up and left (shown in bold). If not, we go up or left, depending on which cell has a higher number. This corresponds to either taking the LCS between and , or and .

Code optimization

Several optimizations can be made to the algorithm above to speed it up for real-world cases.

Reduce the problem set

The C matrix in the naive algorithm grows quadratically with the lengths of the sequences. For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done. In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time. If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.

function LCS(X[1..m], Y[1..n])     start := 1     m_end := m     n_end := n     trim off the matching items at the beginningwhile start ≤ m_end and start ≤ n_end and X[start] = Y[start]         start := start + 1     trim off the matching items at the endwhile start ≤ m_end and start ≤ n_end and X[m_end] = Y[n_end]         m_end := m_end - 1         n_end := n_end - 1     C = array(start-1..m_end, start-1..n_end)     only loop over the items that have changedfor i := start..m_end         for j := start..n_end             the algorithm continues as before ...

In the best-case scenario, a sequence with no changes, this optimization would eliminate the need for the C matrix. In the worst-case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.

Reduce the comparison time

Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences. For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters. This can mean comparisons of relatively long strings for each step in the algorithm. Two optimizations can be made that can help to reduce the time these comparisons consume.

Reduce strings to hashes

A hash function or checksum can be used to reduce the size of the strings in the sequences. That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long. Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

There are three primary drawbacks to this optimization. First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences. Second, additional memory needs to be allocated for the new hashed sequences. However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.

The third drawback is that of collisions. Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash. This is unlikely in source code, but it is possible. A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum. However, the benefits may not be worth the setup and computational requirements of a cryptographic hash for small sequence lengths.

Reduce the required space

If only the length of the LCS is required, the matrix can be reduced to a matrix, or to a vector as the dynamic programming approach requires only the current and previous columns of the matrix. Hirschberg's algorithm allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds. [8]

Reduce cache misses

Chowdhury and Ramachandran devised a quadratic-time linear-space algorithm [9] [10] for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to its superior cache performance. [9] The algorithm has an asymptotically optimal cache complexity under the Ideal cache model. [11] Interestingly, the algorithm itself is cache-oblivious [11] meaning that it does not make any choices based on the cache parameters (e.g., cache size and cache line size) of the machine.

Further optimized algorithms

Several algorithms exist that run faster than the presented dynamic programming approach. One of them is Hunt–Szymanski algorithm, which typically runs in time (for ), where is the number of matches between the two sequences. [12] For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor. [13]

Behavior on random strings

Beginning with Chvátal & Sankoff (1975), [14] a number of researchers have investigated the behavior of the longest common subsequence length when the two given strings are drawn randomly from the same alphabet. When the alphabet size is constant, the expected length of the LCS is proportional to the length of the two strings, and the constants of proportionality (depending on alphabet size) are known as the Chvátal–Sankoff constants. Their exact values are not known, but upper and lower bounds on their values have been proven, [15] and it is known that they grow inversely proportionally to the square root of the alphabet size. [16] Simplified mathematical models of the longest common subsequence problem have been shown to be controlled by the Tracy–Widom distribution. [17]

See also

Related Research Articles

<span class="mw-page-title-main">Euclidean algorithm</span> Algorithm for computing greatest common divisors

In mathematics, the Euclidean algorithm, or Euclid's algorithm, is an efficient method for computing the greatest common divisor (GCD) of two integers (numbers), the largest number that divides them both without a remainder. It is named after the ancient Greek mathematician Euclid, who first described it in his Elements . It is an example of an algorithm, a step-by-step procedure for performing a calculation according to well-defined rules, and is one of the oldest algorithms in common use. It can be used to reduce fractions to their simplest form, and is a part of many other number-theoretic and cryptographic calculations.

<span class="mw-page-title-main">Dynamic programming</span> Problem optimization method

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In recreational mathematics, a repunit is a number like 11, 111, or 1111 that contains only the digit 1 — a more specific type of repdigit. The term stands for "repeated unit" and was coined in 1966 by Albert H. Beiler in his book Recreations in the Theory of Numbers.

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).

In mathematics, a subsequence of a given sequence is a sequence that can be derived from the given sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence is a subsequence of obtained after removal of elements and The relation of one sequence being the subsequence of another is a preorder.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In commutative algebra, a regular sequence is a sequence of elements of a commutative ring which are as independent as possible, in a precise sense. This is the algebraic analogue of the geometric notion of a complete intersection.

Matrix chain multiplication is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to perform the multiplications, but merely to decide the sequence of the matrix multiplications involved. The problem may be solved using dynamic programming.

In computer graphics, the Liang–Barsky algorithm is a line clipping algorithm. The Liang–Barsky algorithm uses the parametric equation of a line and inequalities describing the range of the clipping window to determine the intersections between the line and the clip window. With these intersections it knows which portion of the line should be drawn. So this algorithm is significantly more efficient than Cohen–Sutherland. The idea of the Liang–Barsky clipping algorithm is to do as much testing as possible before computing line intersections.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

In mathematical optimization, constrained optimization is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The objective function is either a cost function or energy function, which is to be minimized, or a reward function or utility function, which is to be maximized. Constraints can be either hard constraints, which set conditions for the variables that are required to be satisfied, or soft constraints, which have some variable values that are penalized in the objective function if, and based on the extent that, the conditions on the variables are not satisfied.

In computer science, the longest increasing subsequence problem aims to find a subsequence of a given sequence in which the subsequence's elements are sorted in an ascending order and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous or unique. The longest increasing subsequences are studied in the context of various disciplines related to mathematics, including algorithmics, random matrix theory, representation theory, and physics. The longest increasing subsequence problem is solvable in time where denotes the length of the input sequence.

In computer science and statistics, the Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric metric proposed in 1990 by William E. Winkler.

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

In computer science, the Hunt–Szymanski algorithm, also known as Hunt–McIlroy algorithm, is a solution to the longest common subsequence problem. It was one of the first non-heuristic algorithms used in diff which compares a pair of files each represented as a sequence of lines. To this day, variations of this algorithm are found in incremental version control systems, wiki engines, and molecular phylogenetics research software.

DNA code construction refers to the application of coding theory to the design of nucleic acid systems for the field of DNA–based computation.

Time Warp Edit Distance (TWED) is a measure of similarity for discrete time series matching with time 'elasticity'. In comparison to other distance measures,, TWED is a metric. Its computational time complexity is , but can be drastically reduced in some specific situations by using a corridor to reduce the search space. Its memory space complexity can be reduced to . It was first proposed in 2009 by P.-F. Marteau.

In computer science, merge-insertion sort or the Ford–Johnson algorithm is a comparison sorting algorithm published in 1959 by L. R. Ford Jr. and Selmer M. Johnson. It uses fewer comparisons in the worst case than the best previously known algorithms, binary insertion sort and merge sort, and for 20 years it was the sorting algorithm with the fewest known comparisons. Although not of practical significance, it remains of theoretical interest in connection with the problem of sorting with a minimum number of comparisons. The same algorithm may have also been independently discovered by Stanisław Trybuła and Czen Ping.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

References

  1. David Maier (1978). "The Complexity of Some Problems on Subsequences and Supersequences". J. ACM. ACM Press. 25 (2): 322–336. doi: 10.1145/322063.322075 . S2CID   16120634.
  2. Wagner, Robert; Fischer, Michael (January 1974). "The string-to-string correction problem". Journal of the ACM . 21 (1): 168–173. CiteSeerX   10.1.1.367.5281 . doi:10.1145/321796.321811. S2CID   13381535.
  3. 1 2 L. Bergroth and H. Hakonen and T. Raita (7–29 September 2000). A survey of longest common subsequence algorithms. Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000. A Curuna, Spain: IEEE Computer Society. pp. 39–48. doi:10.1109/SPIRE.2000.878178. ISBN   0-7695-0746-8. S2CID   10375334.
  4. Ronald I. Greenberg (2003-08-06). "Bounds on the Number of Longest Common Subsequences". arXiv: cs.DM/0301030 .
  5. Xia, Xuhua (2007). Bioinformatics and the Cell: Modern Computational Approaches in Genomics, Proteomics and Transcriptomics . New York: Springer. p.  24. ISBN   978-0-387-71336-6.
  6. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein (2001). "15.4". Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 350–355. ISBN   0-262-53196-8.{{cite book}}: CS1 maint: multiple names: authors list (link)
  7. Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. "Dynamic Programming". Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. p. 394. ISBN   0-262-03384-4.
  8. Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM. 18 (6): 341–343. doi: 10.1145/360825.360861 . S2CID   207694727.
  9. 1 2 Chowdhury, Rezaul; Ramachandran, Vijaya (January 2006). "Cache-oblivious dynamic programming". Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm - SODA '06. pp. 591–600. doi:10.1145/1109557.1109622. ISBN   0898716055. S2CID   9650418.
  10. Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010). "Cache-oblivious dynamic programming for bioinformatics". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 7 (3): 495–510. doi:10.1109/TCBB.2008.94. PMID   20671320. S2CID   2532039.
  11. 1 2 Frigo, Matteo; Leiserson, Charles E.; Prokop, Harald; Ramachandran, Sridhar (January 2012). "Cache-oblivious algorithms". ACM Transactions on Algorithms. 8 (1): 1–22. doi:10.1145/2071379.2071383.
  12. Apostolico, Alberto; Galil, Zvi (1997-05-29). Pattern Matching Algorithms. Oxford University Press. ISBN   9780195354348.
  13. Masek, William J.; Paterson, Michael S. (1980), "A faster algorithm computing string edit distances", Journal of Computer and System Sciences, 20 (1): 18–31, doi: 10.1016/0022-0000(80)90002-1 , MR   0566639 .
  14. Chvátal, Václáv; Sankoff, David (1975), "Longest common subsequences of two random sequences", Journal of Applied Probability, 12 (2): 306–315, doi:10.2307/3212444, JSTOR   3212444, MR   0405531, S2CID   250345191 .
  15. Lueker, George S. (2009), "Improved bounds on the average length of longest common subsequences", Journal of the ACM , 56 (3), A17, doi:10.1145/1516512.1516519, MR   2536132, S2CID   7232681 .
  16. Kiwi, Marcos; Loebl, Martin; Matoušek, Jiří (2005), "Expected length of the longest common subsequence for large alphabets", Advances in Mathematics , 197 (2): 480–498, arXiv: math/0308234 , doi: 10.1016/j.aim.2004.10.012 , MR   2173842 .
  17. Majumdar, Satya N.; Nechaev, Sergei (2005), "Exact asymptotic results for the Bernoulli matching model of sequence alignment", Physical Review E, 72 (2): 020901, 4, arXiv: q-bio/0410012 , Bibcode:2005PhRvE..72b0901M, doi:10.1103/PhysRevE.72.020901, MR   2177365, PMID   16196539, S2CID   11390762 .