Subsequence

Last updated

In mathematics, a subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence is a subsequence of obtained after removal of elements , , and . The relation of one sequence being the subsequence of another is a preorder.

Contents

Subsequences can contain consecutive elements which were not consecutive in the original sequence. A subsequence which consists of a consecutive run of elements from the original sequence, such as from , is a substring. The substring is a refinement of the subsequence.

The list of all subsequences for the word "apple" would be "a", "ap", "al", "ae", "app", "apl", "ape", "ale", "appl", "appe", "aple", "apple", "p", "pp", "pl", "pe", "ppl", "ppe", "ple", "pple", "l", "le", "e", "" (empty string).

Common subsequence

Given two sequences X and Y, a sequence Z is said to be a common subsequence of X and Y, if Z is a subsequence of both X and Y. For example, if

and
and

then is said to be a common subsequence of X and Y.

This would not be the longest common subsequence , since Z only has length 3, and the common subsequence has length 4. The longest common subsequence of X and Y is .

Applications

Subsequences have applications to computer science, [1] especially in the discipline of bioinformatics, where computers are used to compare, analyze, and store DNA, RNA, and protein sequences.

Take two sequences of DNA containing 37 elements, say:

SEQ1 = ACGGTGTCGTGCTATGCTGATGCTGACTTATATGCTA
SEQ2 = CGTTCGGCTATCGTACGTTCTATTCTATGATTTCTAA

The longest common subsequence of sequences 1 and 2 is:

LCS(SEQ1,SEQ2) = CGTTCGGCTATGCTTCTACTTATTCTA

This can be illustrated by highlighting the 27 elements of the longest common subsequence into the initial sequences:

SEQ1 = ACGGTGTCGTGCTATGCTGATGCTGACTTATATGCTA
SEQ2 = CGTTCGGCTATCGTACGTTCTATTCTATGATTTCTAA

Another way to show this is to align the two sequences, i.e., to position elements of the longest common subsequence in a same column (indicated by the vertical bar) and to introduce a special character (here, a dash) for padding of arisen empty subsequences:

SEQ1 = ACGGTGTCGTGCTAT-G--C-TGATGCTGA--CT-T-ATATG-CTA-
        | || ||| ||||| |  | |  | || |  || | || |  |||
SEQ2 = -C-GT-TCG-GCTATCGTACGT--T-CT-ATTCTATGAT-T-TCTAA

Subsequences are used to determine how similar the two strands of DNA are, using the DNA bases: adenine, guanine, cytosine and thymine.

Theorems

See also

Notes

  1. In computer science, string is often used as a synonym for sequence, but it is important to note that substring and subsequence are not synonyms. Substrings are consecutive parts of a string, while subsequences need not be. This means that a substring of a string is always a subsequence of the string, but a subsequence of a string is not always a substring of the string, see: Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. p. 4. ISBN   0-521-58519-8.

This article incorporates material from subsequence on PlanetMath, which is licensed under the Creative Commons Attribution/Share-Alike License.

Related Research Articles

In mathematics, more specifically in functional analysis, a Banach space is a complete normed vector space. Thus, a Banach space is a vector space with a metric that allows the computation of vector length and distance between vectors and is complete in the sense that a Cauchy sequence of vectors always converges to a well defined limit that is within the space.

Cauchy sequence A sequence of points that get progressively closer to each other

In mathematics, a Cauchy sequence, named after Augustin-Louis Cauchy, is a sequence whose elements become arbitrarily close to each other as the sequence progresses. More precisely, given any small positive distance, all but a finite number of elements of the sequence are less than that given distance from each other.

Inner product space Generalization of the dot product; used to defined Hilbert spaces

In mathematics, an inner product space or a Hausdorff pre-Hilbert space is a vector space with a binary operation called an inner product. This operation associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors, often denoted using angle brackets. Inner products allow the rigorous introduction of intuitive geometrical notions, such as the length of a vector or the angle between two vectors. They also provide the means of defining orthogonality between vectors. Inner product spaces generalize Euclidean spaces to vector spaces of any dimension, and are studied in functional analysis. Inner product spaces over the field of complex numbers are sometimes referred to as unitary spaces. The first usage of the concept of a vector space with an inner product is due to Giuseppe Peano, in 1898.

Riesz representation theorem, sometimes called Riesz–Fréchet representation theorem, named after Frigyes Riesz and Maurice René Fréchet, establishes an important connection between a Hilbert space and its continuous dual space. If the underlying field is the real numbers, the two are isometrically isomorphic; if the underlying field is the complex numbers, the two are isometrically anti-isomorphic. The (anti-) isomorphism is a particular natural one as will be described next; a natural isomorphism.

In mathematics, a presentation is one method of specifying a group. A presentation of a group G comprises a set S of generators—so that every element of the group can be written as a product of powers of some of these generators—and a set R of relations among those generators. We then say G has presentation

In mathematics, specifically ring theory, a principal ideal is an ideal in a ring that is generated by a single element of through multiplication by every element of The term also has another, similar meaning in order theory, where it refers to an (order) ideal in a poset generated by a single element which is to say the set of all elements less than or equal to in

Longest common subsequence problem

The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring problem: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In mathematics, spectral theory is an inclusive term for theories extending the eigenvector and eigenvalue theory of a single square matrix to a much broader theory of the structure of operators in a variety of mathematical spaces. It is a result of studies of linear algebra and the solutions of systems of linear equations and their generalizations. The theory is connected to that of analytic functions because the spectral properties of an operator are related to analytic functions of the spectral parameter.

In mathematics, specifically in functional analysis, each bounded linear operator on a complex Hilbert space has a corresponding Hermitian adjoint. Adjoints of operators generalize conjugate transposes of square matrices to (possibly) infinite-dimensional situations. If one thinks of operators on a complex Hilbert space as generalized complex numbers, then the adjoint of an operator plays the role of the complex conjugate of a complex number.

In mathematics, the Lasker–Noether theorem states that every Noetherian ring is a Lasker ring, which means that every ideal can be decomposed as an intersection, called primary decomposition, of finitely many primary ideals. The theorem was first proven by Emanuel Lasker (1905) for the special case of polynomial rings and convergent power series rings, and was proven in its full generality by Emmy Noether (1921).

In mathematics, more specifically functional analysis and operator theory, the notion of unbounded operator provides an abstract framework for dealing with differential operators, unbounded observables in quantum mechanics, and other cases.

In mathematics, weak convergence in a Hilbert space is convergence of a sequence of points in the weak topology.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

Hyperplane separation theorem On the existence of hyperplanes separating disjoint convex sets

In geometry, the hyperplane separation theorem is a theorem about disjoint convex sets in n-dimensional Euclidean space. There are several rather similar versions. In one version of the theorem, if both these sets are closed and at least one of them is compact, then there is a hyperplane in between them and even two parallel hyperplanes in between them separated by a gap. In another version, if both disjoint convex sets are open, then there is a hyperplane in between them, but not necessarily any gap. An axis which is orthogonal to a separating hyperplane is a separating axis, because the orthogonal projections of the convex bodies onto the axis are disjoint.

In functional analysis, the concept of a compact operator on Hilbert space is an extension of the concept of a matrix acting on a finite-dimensional vector space; in Hilbert space, compact operators are precisely the closure of finite-rank operators in the topology induced by the operator norm. As such, results from matrix theory can sometimes be extended to compact operators using similar arguments. By contrast, the study of general operators on infinite-dimensional spaces often requires a genuinely different approach.

In mathematics, the Hilbert projection theorem is a famous result of convex analysis that says that for every vector in a Hilbert space and every nonempty closed convex there exists a unique vector for which is minimized over the vectors ; that is, such that for every

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

In mathematics, the Toda bracket is an operation on homotopy classes of maps, in particular on homotopy groups of spheres, named after Hiroshi Toda, who defined them and used them to compute homotopy groups of spheres in.

Hilbert space Mathematical generalization of Euclidean space to infinite dimensions

The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions. A Hilbert space is a vector space equipped with an inner product, an operation that allows lengths and angles to be defined. Furthermore, Hilbert spaces are complete, which means that there are enough limits in the space to allow the techniques of calculus to be used.

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.