Shortest common supersequence

Last updated

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

Contents

A shortest common supersequence (SCS) is a common supersequence of minimal length. In the shortest common supersequence problem, two sequences X and Y are given, and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique.

For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily. For example, the longest common subsequence of X and Y is Z. By inserting the non-LCS symbols into Z while preserving their original order, we obtain a shortest common supersequence U. In particular, the equation holds for any two input sequences.

There is no similar relationship between shortest common supersequences and longest common subsequences of three or more input sequences. (In particular, LCS and SCS are not dual problems.) However, both problems can be solved in time using dynamic programming, where is the number of sequences, and is their maximum length. For the general case of an arbitrary number of input sequences, the problem is NP-hard. [1]

Shortest common superstring

The closely related problem of finding a minimum-length string which is a superstring of a finite set of strings S = { s1,s2,...,sn } is also NP-hard. [2] Several constant factor approximations have been proposed throughout the years, and the current best known algorithm has an approximation factor of 2.475. [3] However, perhaps the simplest solution is to reformulate the problem as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring S. One can then use the O(log(n))-approximation for weighted set-cover to obtain an O(log(n))-approximation for the shortest superstring (note that this is not a constant factor approximation).

For any string x in this alphabet, define P(x) to be the set of all strings which are substrings of x. The instance I of set cover is formulated as follows:

The instance I can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings x for which the weighted set cover algorithm outputs P(x). [4]

Example

Consider the set S = { abc, cde, fab }, which becomes the universe of the weighted set cover instance. In this case, M = { abcde, fabc }. Then the set of subsets of the universe is

which have costs 3, 3, 3, 5, and 4, respectively.

Related Research Articles

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms that can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

<span class="mw-page-title-main">Dynamic programming</span> Problem optimization method

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In mathematics, a subsequence of a given sequence is a sequence that can be derived from the given sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence is a subsequence of obtained after removal of elements and The relation of one sequence being the subsequence of another is a preorder.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

The Arzelà–Ascoli theorem is a fundamental result of mathematical analysis giving necessary and sufficient conditions to decide whether every sequence of a given family of real-valued continuous functions defined on a closed and bounded interval has a uniformly convergent subsequence. The main condition is the equicontinuity of the family of functions. The theorem is the basis of many proofs in mathematics, including that of the Peano existence theorem in the theory of ordinary differential equations, Montel's theorem in complex analysis, and the Peter–Weyl theorem in harmonic analysis and various results concerning compactness of integral operators.

The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory.

A fully polynomial-time approximation scheme (FPTAS) is an algorithm for finding approximate solutions to function problems, especially optimization problems. An FPTAS takes as input an instance of the problem and a parameter ε > 0. It returns as output a value is at least times the correct value, and at most times the correct value.

<span class="mw-page-title-main">Substring</span>

In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "the best of" is a substring of "It was the best of times". In contrast, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

In computer science, the longest increasing subsequence problem aims to find a subsequence of a given sequence in which the subsequence's elements are sorted in an ascending order and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous or unique. The longest increasing subsequences are studied in the context of various disciplines related to mathematics, including algorithmics, random matrix theory, representation theory, and physics. The longest increasing subsequence problem is solvable in time where denotes the length of the input sequence.

In computational complexity theory, a computational hardness assumption is the hypothesis that a particular problem cannot be solved efficiently. It is not known how to prove (unconditional) hardness for essentially any useful problem. Instead, computer scientists rely on reductions to formally relate the hardness of a new or complicated problem to a computational hardness assumption about a problem that is better-understood.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In graph theory and theoretical computer science, the longest path problem is the problem of finding a simple path of maximum length in a given graph. A path is called simple if it does not have any repeated vertices; the length of a path may either be measured by its number of edges, or by the sum of the weights of its edges. In contrast to the shortest path problem, which can be solved in polynomial time in graphs without negative-weight cycles, the longest path problem is NP-hard and the decision version of the problem, which asks whether a path exists of at least some given length, is NP-complete. This means that the decision problem cannot be solved in polynomial time for arbitrary graphs unless P = NP. Stronger hardness results are also known showing that it is difficult to approximate. However, it has a linear time solution for directed acyclic graphs, which has important applications in finding the critical path in scheduling problems.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computer science, the Hunt–Szymanski algorithm, also known as Hunt–McIlroy algorithm, is a solution to the longest common subsequence problem. It was one of the first non-heuristic algorithms used in diff which compares a pair of files each represented as a sequence of lines. To this day, variations of this algorithm are found in incremental version control systems, wiki engines, and molecular phylogenetics research software.

The maximum coverage problem is a classical question in computer science, computational complexity theory, and operations research. It is a problem that is widely taught in approximation algorithms.

In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied as a streaming algorithm and the objective is, given a sequence of points, to construct a good clustering of the stream, using a small amount of memory and time.

In computer science, multiway number partitioning is the problem of partitioning a multiset of numbers into a fixed number of subsets, such that the sums of the subsets are as similar as possible. It was first presented by Ronald Graham in 1969 in the context of the identical-machines scheduling problem. The problem is parametrized by a positive integer k, and called k-way number partitioning. The input to the problem is a multiset S of numbers, whose sum is k*T.

Identical-machines scheduling is an optimization problem in computer science and operations research. We are given n jobs J1, J2, ..., Jn of varying processing times, which need to be scheduled on m identical machines, such that a certain objective function is optimized, for example, the makespan is minimized.

References

  1. David Maier (1978). "The Complexity of Some Problems on Subsequences and Supersequences". J. ACM. ACM Press. 25 (2): 322–336. doi: 10.1145/322063.322075 . S2CID   16120634.
  2. Kari-Jouko Räihä, Esko Ukkonen (1981). "The shortest common supersequence problem over binary alphabet is NP-complete". Theoretical Computer Science. 16 (2): 187–198. doi: 10.1016/0304-3975(81)90075-x .
  3. Matthias Englert and Nicolaos Matsakis and Pavel Vesel (2022). "Improved approximation guarantees for shortest superstrings using cycle classification by overlap to length ratios". Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (PDF). pp. 317–330. doi:10.1145/3519935.3520001. ISBN   9781450392648. S2CID   243847650.
  4. Vazirani, p. 20.