String-to-string correction problem

Last updated

In computer science, the string-to-string correction problem refers to determining the minimum cost sequence of edit operations necessary to change one string into another (i.e., computing the shortest edit distance). Each type of edit operation has its own cost value. [1] A single edit operation may be changing a single symbol of the string into another (cost WC), deleting a symbol (cost WD), or inserting a new symbol (cost WI). [2]

If all edit operations have the same unit costs (WC = WD = WI = 1) the problem is the same as computing the Levenshtein distance of two strings.

Several algorithms exist to provide an efficient way to determine string distance and specify the minimum number of transformation operations required. [3] [4] Such algorithms are particularly useful for delta creation operations where something is stored as a set of differences relative to a base version. This allows several versions of a single object to be stored much more efficiently than storing them separately. This holds true even for single versions of several objects if they do not differ greatly, or anything in between. Notably, such difference algorithms are used in molecular biology to provide some measure of kinship between different kinds of organisms based on the similarities of their macromolecules (such as proteins or DNA).

Extension

The extended variant of the problem includes a new type of edit operation: swapping any two adjacent symbols, with a cost of WS.

This version can be solved in polynomial time under certain restrictions on edit operation costs. [2] [5]

Robert A. Wagner (1975) showed that the general problem is NP-complete. In particular, he proved that when WI < WC = WD = ∞ and 0 < WS < ∞ (or equivalently, changing and deletion are not permitted), the problem is NP-complete. [5]

Related Research Articles

<span class="mw-page-title-main">NP (complexity)</span> Complexity class used to classify decision problems

In computational complexity theory, NP is a complexity class used to classify decision problems. NP is the set of decision problems for which the problem instances, where the answer is "yes", have proofs verifiable in polynomial time by a deterministic Turing machine, or alternatively the set of problems that can be solved in polynomial time by a nondeterministic Turing machine.

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

<span class="mw-page-title-main">Shortest path problem</span> Computational problem of graph theory

In graph theory, the shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

<span class="mw-page-title-main">Dijkstra's algorithm</span> Graph search algorithm

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In compiler optimization, register allocation is the process of assigning local automatic variables and expression results to a limited number of processor registers.

<span class="mw-page-title-main">Steiner tree problem</span> On short connecting networks with added vertices

In combinatorial mathematics, the Steiner tree problem, or minimum Steiner tree problem, named after Jakob Steiner, is an umbrella term for a class of problems in combinatorial optimization. While Steiner tree problems may be formulated in a number of settings, they all require an optimal interconnect for a given set of objects and a predefined objective function. One well-known variant, which is often used synonymously with the term Steiner tree problem, is the Steiner tree problem in graphs. Given an undirected graph with non-negative edge weights and a subset of vertices, usually referred to as terminals, the Steiner tree problem in graphs requires a tree of minimum weight that contains all terminals and minimizes the total weight of its edges. Further well-known variants are the Euclidean Steiner tree problem and the rectilinear minimum Steiner tree problem.

<i>k</i>-minimum spanning tree

The k-minimum spanning tree problem, studied in theoretical computer science, asks for a tree of minimum cost that has exactly k vertices and forms a subgraph of a larger graph. It is also called the k-MST or edge-weighted k-cardinality tree. Finding this tree is NP-hard, but it can be approximated to within a constant approximation ratio in polynomial time.

The graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

In computational complexity, problems that are in the complexity class NP but are neither in the class P nor NP-complete are called NP-intermediate, and the class of such problems is called NPI. Ladner's theorem, shown in 1975 by Richard E. Ladner, is a result asserting that, if P ≠ NP, then NPI is not empty; that is, NP contains problems that are neither in P nor NP-complete. Since it is also true that if NPI problems exist, then P ≠ NP, it follows that P = NP if and only if NPI is empty.

<span class="mw-page-title-main">Dan Hirschberg</span> American computer scientist

Daniel S. Hirschberg is a full professor in Computer Science at University of California, Irvine. His research interests are in the theory of design and analysis of algorithms.

In mathematical logic, monadic second-order logic (MSO) is the fragment of second-order logic where the second-order quantification is limited to quantification over sets. It is particularly important in the logic of graphs, because of Courcelle's theorem, which provides algorithms for evaluating monadic second-order formulas over graphs of bounded treewidth. It is also of fundamental importance in automata theory, where the Büchi–Elgot–Trakhtenbrot theorem gives a logical characterization of the regular languages.

In computer science, lattice problems are a class of optimization problems related to mathematical objects called lattices. The conjectured intractability of such problems is central to the construction of secure lattice-based cryptosystems: Lattice problems are an example of NP-hard problems which have been shown to be average-case hard, providing a test case for the security of cryptographic algorithms. In addition, some lattice problems which are worst-case hard can be used as a basis for extremely secure cryptographic schemes. The use of worst-case hardness in such schemes makes them among the very few schemes that are very likely secure even against quantum computers. For applications in such cryptosystems, lattices over vector space or free modules are generally considered.

Michael John Fischer is an American computer scientist who works in the fields of distributed computing, parallel computing, cryptography, algorithms and data structures, and computational complexity.

In computer science, the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.

References

  1. Wagner, Robert A.; Fischer, Michael J. (1974). "The String-to-String Correction Problem". Journal of the ACM. 21 (1): 168–173. doi: 10.1145/321796.321811 . S2CID   13381535.
  2. 1 2 Lowrance, Roy; Wagner, Robert A. (April 1975). "An Extension of the String-to-String Correction Problem". Journal of the ACM. 22 (2): 177–183. doi: 10.1145/321879.321880 . S2CID   18892193.
  3. Edit distance#Computation
  4. Tichy, Walter F. (1984). "The string-to-string correction problem with block moves". ACM Transactions on Computer Systems. 2 (4): 309–321. doi: 10.1145/357401.357404 . S2CID   14034845.
  5. 1 2 Wagner, Robert A. (May 1975). "On the complexity of the Extended String-to-String Correction Problem". Proceedings of seventh annual ACM symposium on Theory of computing - STOC '75. pp. 218–223. doi:10.1145/800116.803771. ISBN   9781450374194. S2CID   18705107.