Edit distance

Last updated

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Contents

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance. [1]

Types of edit distance

Different types of edit distance allow different sets of string operations. For instance:

Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Formal definition and properties

Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966: [2]

Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
Deletion of a single symbol changes uxv to uv (x→ε).
Substitution of a single symbol x for a symbol yx changes uxv to uyv (xy).

In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x, y) with the operations. [2]

Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv. [3] [4] For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa. [4]

Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost. [1] :37 Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings. [1] Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

  1. kitten → sitten (substitute "s" for "k")
  2. sitten → sittin (substitute "i" for "e")
  3. sittin → sitting (insert "g" at the end)

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

  1. kitten → itten (delete "k" at 0)
  2. itten → sitten (insert "s" at 0)
  3. sitten → sittn (delete "e" at 4)
  4. sittn → sittin (insert "i" at 4)
  5. sittin → sitting (insert "g" at 6)

for a total cost/distance of 5 operations.

Properties

Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met: [1] :37

With these properties, the metric axioms are satisfied as follows:

d(a, b) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.
d(a, b) > 0 when ab, since this would require at least one operation at non-zero cost.
d(a, b) = d(b, a) by equality of the cost of each operation and its inverse.
Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c). [5]

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature. [1]

Other useful properties of unit-cost edit distances include:

Regardless of cost/weights, the following property holds of all edit distances:

Computation

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964. [6]

Common algorithm

Using Levenshtein's original operations, the (nonsymmetric) edit distance from to is given by , defined by the recurrence [2]

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization. [3]

The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer, [7] although it has a history of multiple invention. [2] [3] After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at .

This algorithm has a time complexity of Θ(mn) where m and n are the lengths of the strings. When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations. [3] A linear-space solution to this problem is offered by Hirschberg's algorithm. [8] :634 A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran. [9]

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants, [10] one of which takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s2) or O(s), depending on whether the edit sequence needs to be read off. [3]

Further improvements by Landau, Myers, and Schmidt give an O(s2 + max(m,n)) time algorithm. [11]

For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson [12] having worst case runtime of O(nm/logn).

Applications

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

Language edit distance

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet Σ, the language edit distance d(L, x) is given by [14] , where is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. [15] For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance. [16]

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem. [14] [17]

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

<span class="mw-page-title-main">Travelling salesman problem</span> NP-hard problem in combinatorial optimization

The travelling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in theoretical computer science and operations research.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

In computer science, the string-to-string correction problem refers to determining the minimum cost sequence of edit operations necessary to change one string into another. Each type of edit operation has its own cost value. A single edit operation may be changing a single symbol of the string into another, deleting a symbol, or inserting a new symbol.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

In computer science and statistics, the Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric metric proposed in 1990 by William E. Winkler.

In mathematics and computer science, a string metric is a metric that measures distance between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.

In computer science, the earth mover's distance (EMD) is a measure of dissimilarity between two frequency distributions, densities, or measures, over a metric space D. Informally, if the distributions are interpreted as two different ways of piling up earth (dirt) over D, the EMD captures the minimum cost of building the smaller pile using dirt taken from the larger, where cost is defined as the amount of dirt moved multiplied by the distance over which it is moved.

In computer science, the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.

In mathematics and computer science, graph edit distance (GED) is a measure of similarity between two graphs. The concept of graph edit distance was first formalized mathematically by Alberto Sanfeliu and King-Sun Fu in 1983. A major application of graph edit distance is in inexact graph matching, such as error-tolerant pattern recognition in machine learning.

References

  1. 1 2 3 4 5 6 7 Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching" (PDF). ACM Computing Surveys. 33 (1): 31–88. CiteSeerX   10.1.1.452.6317 . doi:10.1145/375360.375365. S2CID   207551224 . Retrieved 19 March 2015.
  2. 1 2 3 4 Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
  3. 1 2 3 4 5 Esko Ukkonen (1983). On approximate string matching. Foundations of Computation Theory. Springer. pp. 487–495. doi:10.1007/3-540-12689-9_129.
  4. 1 2 3 4 Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX   10.1.1.16.652 . doi:10.1007/s10032-002-0082-8. S2CID   207046453.
  5. Lei Chen; Raymond Ng (2004). On the marriage of Lp-norms and edit distance (PDF). Proc. 30th Int'l Conf. on Very Large Databases (VLDB). Vol. 30. doi:10.1016/b978-012088469-8.50070-x.
  6. Kukich, Karen (1992). "Techniques for Automatically Correcting Words in Text" (PDF). ACM Computing Surveys. 24 (4): 377–439. doi:10.1145/146370.146380. S2CID   5431215. Archived from the original (PDF) on 2016-09-27. Retrieved 2017-11-09.
  7. R. Wagner; M. Fischer (1974). "The string-to-string correction problem". J. ACM. 21: 168–178. doi: 10.1145/321796.321811 . S2CID   13381535.
  8. Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. Bibcode:2008adm..book.....S. ISBN   978-1-849-96720-4.
  9. Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010). "Cache-oblivious dynamic programming for bioinformatics". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 7 (3): 495–510. doi:10.1109/TCBB.2008.94. PMID   20671320. S2CID   2532039.
  10. Ukkonen, Esko (1985). "Algorithms for approximate string matching" (PDF). Information and Control. 64 (1–3): 100–118. doi: 10.1016/S0019-9958(85)80046-2 .
  11. Landau; Myers; Schmidt (1998). "Incremental String Comparison". SIAM Journal on Computing. 27 (2): 557–582. CiteSeerX   10.1.1.38.1766 . doi:10.1137/S0097539794264810.
  12. Masek, William J.; Paterson, Michael S. (February 1980). "A faster algorithm computing string edit distances". Journal of Computer and System Sciences. 20 (1): 18–31. doi: 10.1016/0022-0000(80)90002-1 . ISSN   0022-0000.
  13. Esko Ukkonen (1985). "Finding approximate patterns in strings". J. Algorithms. 6: 132–137. doi:10.1016/0196-6774(85)90023-9.
  14. 1 2 Bringmann, Karl; Grandoni, Fabrizio; Saha, Barna; Williams, Virginia Vassilevska (2016). "Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product" (PDF). 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). pp. 375–384. arXiv: 1707.05095 . doi:10.1109/focs.2016.48. ISBN   978-1-5090-3933-3. S2CID   17064578.
  15. Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages". SIAM Journal on Computing. 1 (4): 305–312. doi:10.1137/0201022. ISSN   0097-5397.
  16. Wagner, Robert A. (1974). "Order-n correction for regular languages". Communications of the ACM. 17 (5): 265–268. doi: 10.1145/360980.360995 . S2CID   11063282.
  17. Saha, B. (2014-10-01). The Dyck Language Edit Distance Problem in Near-Linear Time. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. pp. 611–620. doi:10.1109/FOCS.2014.71. ISBN   978-1-4799-6517-5. S2CID   14806359.