Closest string

Last updated

In theoretical computer science, the closest string is an NP-hard computational problem, [1] which tries to find the geometrical center of a set of input strings.

Contents

To understand the word "center", it is necessary to define a distance between two strings. Usually, this problem is studied with the Hamming distance in mind.

Formal definition

More formally, given n strings s1, s2, ..., sn of length m, the closest string problem seeks a new string s of length m such that d(s,si) ≤ k for all i, where d is the Hamming distance, and where k is as small as possible. [2] A decision problem version of the closest string problem, which is NP-complete, instead takes k as another input and questions whether there is a string within Hamming distance k of all the input strings. [1]

The closest string problem can be seen as a special case of the generic 1-center problem in which the distances between elements are measured using Hamming distance.

Motivation

In bioinformatics, the closest string problem is an intensively studied facet of the problem of finding signals in DNA.

Simplifications and data reductions

Instances of closest string may contain information that is not essential to the problem. In some sense, the usual input of closest string contains information, that does not contribute to the hardness of the problem. For example, if some strings contain the character a, but none contains the character z, replacing all as with zs would yield an essentially equivalent instance, that is: from a solution of the modified instance, the original solution can be restored, and vice versa.

Normalizing the input

When all input strings that share the same length are written on top of each other, they form a matrix. Certain row types have essentially the same implications to the solution. For example, replacing a column with entries (a, a, b) with another column (x, x, y) might lead to a different solution string, but cannot affect solvability, because both columns express the same structure, viz. the first two entries being equal, but different from the third one.

An input instance can be normalized by replacing, in each column, the character that occurs the most often with a, the character that occurs the second most often with b, and so forth. Given a solution to the normalized instance, the original instance can be found by remapping the characters of the solution to its original version in every column.

The order of the columns does not contribute to the hardness of the problem. That means, if we permute all input strings according to a certain permutation π and obtain a solution string s to that modified instance, then π−1(s) will be a solution to the original instance.

Example

Search space for the normalized problem. The center string aaaa and aaab leads to Hamming distances 1,2,1 and 2,1,1, respectively. Closest-string problem example svg.svg
Search space for the normalized problem. The center string aaaa and aaab leads to Hamming distances 1,2,1 and 2,1,1, respectively.

Given an instance with three input strings uvwx, xuwv, and xvwu. This could be written as a matrix like this:

uvwx
xuwv
xvwu

The first column has the values (u, x, x). As x is the character that appears the most often, we replace it by a, and we replace u, the second most often character, by b, obtaining the new first column (b, a, a). The second column has the values (v, u, v). As for the first column, v is replaced by a and u is replaced by b, obtaining the new second column (a, b, a). Doing the same with all columns gives the normalized instance

baaa
abab
aaac

Data reduction obtained from normalization

Normalizing the input reduces the alphabet size to at most the number of input strings. This can be useful for algorithms whose running times depend on the alphabet size.

Approximability

Li et al. evolved a polynomial-time approximation scheme [3] which is practically unusable because of the large hidden constants.

Fixed-parameter tractability

Closest String can be solved in , where k is the number of input strings, L is the length of all strings and d is the desired maximum distance from the solution string to any input string. [4]

Relations to other problems

Closest string is a special case of the more general closest substring problem, which is strictly more difficult. While closest string turns out to be fixed-parameter tractable in a number of ways, closest substring is W[1]-hard with regard to these parameters.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">NP (complexity)</span> Complexity class used to classify decision problems

In computational complexity theory, NP is a complexity class used to classify decision problems. NP is the set of decision problems for which the problem instances, where the answer is "yes", have proofs verifiable in polynomial time by a deterministic Turing machine, or alternatively the set of problems that can be solved in polynomial time by a nondeterministic Turing machine.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, parameterized complexity is a branch of computational complexity theory that focuses on classifying computational problems according to their inherent difficulty with respect to multiple parameters of the input or output. The complexity of a problem is then measured as a function of those parameters. This allows the classification of NP-hard problems on a finer scale than in the classical setting, where the complexity of a problem is only measured as a function of the number of bits in the input. This appears to have been first demonstrated in Gurevich, Stockmeyer & Vishkin (1984). The first systematic work on parameterized complexity was done by Downey & Fellows (1999).

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

MAX-3SAT is a problem in the computational complexity subfield of computer science. It generalises the Boolean satisfiability problem (SAT) which is a decision problem considered in complexity theory. It is defined as:

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

String functions are used in computer programming languages to manipulate a string or query information about a string.

In theoretical computer science, a computational problem is a problem that may be solved by an algorithm. For example, the problem of factoring

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

In computer science, the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

In mathematics, the Chvátal–Sankoff constants are mathematical constants that describe the lengths of longest common subsequences of random strings. Although the existence of these constants has been proven, their exact values are unknown. They are named after Václav Chvátal and David Sankoff, who began investigating them in the mid-1970s.

References

  1. 1 2 Lanctot, J. Kevin; Li, Ming; Ma, Bin; Wang, Shaojiu; Zhang, Louxin (2003), "Distinguishing string selection problems", Information and Computation , 185 (1): 41–55, doi:10.1016/S0890-5401(03)00057-9, MR   1994748
  2. Bin Ma; Xiaming Sun (2008). "More Efficient Algorithms for Closest String and Substring Problems" (PDF). Research in Computational Molecular Biology. 12th Ann. Int. Conf. on Research in Computational Molecular Biology (RECOMB). LNCS. Vol. 4955. Springer. pp. 396–409. doi:10.1007/978-3-540-78839-3_33. ISBN   978-3-540-78838-6.
  3. M. Li; B. Ma; L. Wang. (2002), "On the Closest String and Substring Problems." (PDF), Journal of the ACM , 49 (2): 157–171, arXiv: cs/0002012 , doi:10.1145/506147.506150, S2CID   965332
  4. Jens Gramm; Rolf Niedermeier; Peter Rossmanith (2003), "Fixed-Parameter Algorithms for Closest String and Related Problems", Algorithmica , 37: 25–42, CiteSeerX   10.1.1.61.736 , doi:10.1007/s00453-003-1028-3, S2CID   8206021