Hirschberg's algorithm

Last updated

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. [1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Contents

Algorithm information

Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If and are strings, where and , the Needleman–Wunsch algorithm finds an optimal alignment in time, using space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes time, but needs only space and is much faster in practice. [2] One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.

The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that: [3]

  1. one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
  2. if is the optimal alignment of , and is an arbitrary partition of , there exists a partition of such that .

Algorithm description

denotes the i-th character of , where . denotes a substring of size , ranging from the i-th to the j-th character of . is the reversed version of .

and are sequences to be aligned. Let be a character from , and be a character from . We assume that , and are well defined integer-valued functions. These functions represent the cost of deleting , inserting , and replacing with , respectively.

We define , which returns the last line of the Needleman–Wunsch score matrix :

function NWScore(X, Y)     Score(0, 0) = 0 // 2 * (length(Y) + 1) array     for j = 1 to length(Y)         Score(0, j) = Score(0, j - 1) + Ins(Yj)     for i = 1 to length(X) // Init array         Score(1, 0) = Score(0, 0) + Del(Xi)         for j = 1 to length(Y)             scoreSub = Score(0, j - 1) + Sub(Xi, Yj)             scoreDel = Score(0, j) + Del(Xi)             scoreIns = Score(1, j - 1) + Ins(Yj)             Score(1, j) = max(scoreSub, scoreDel, scoreIns)         end         // Copy Score[1] to Score[0]         Score(0, :) = Score(1, :)     endfor j = 0 to length(Y)         LastLine(j) = Score(1, j)     return LastLine

Note that at any point, only requires the two most recent rows of the score matrix. Thus, is implemented in space.

The Hirschberg algorithm follows:

function Hirschberg(X, Y)     Z = ""     W = ""     if length(X) == 0         for i = 1 to length(Y)             Z = Z + '-'             W = W + Yiendelse if length(Y) == 0         for i = 1 to length(X)             Z = Z + Xi             W = W + '-'         endelse if length(X) == 1 or length(Y) == 1         (Z, W) = NeedlemanWunsch(X, Y)     else         xlen = length(X)         xmid = length(X) / 2         ylen = length(Y)          ScoreL = NWScore(X1:xmid, Y)         ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))         ymid = arg max ScoreL + rev(ScoreR)          (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)     endreturn (Z, W)

In the context of observation (2), assume that is a partition of . Index is computed such that and .

Example

Let

The optimal alignment is given by

 W = AGTACGCA  Z = --TATGC-

Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:

T   A   T   G   C0  -2  -4  -6  -8 -10  A-2  -1   0  -2  -4  -6  G-4  -3  -2  -1   0  -2  T  -6  -2  -4   0  -2  -1  A  -8  -4   0  -2  -1  -3  C -10  -6  -2  -1  -3   1  G -12  -8  -4  -3   1  -1  C -14 -10  -6  -5  -1   3A -16 -12  -8  -7  -3   1

One starts with the top level call to , which splits the first argument in half: . The call to produces the following matrix:

T   A   T   G   C     0  -2  -4  -6  -8 -10  A -2  -1   0  -2  -4  -6  G -4  -3  -2  -1   0  -2  T -6  -2  -4   0  -2  -1  A -8  -4   0  -2  -1  -3

Likewise, generates the following matrix:

C   G   T   A   T     0 -2  -4  -6  -8 -10  A -2 -1  -3  -5  -4  -6  C -4  0  -2  -4  -6  -5  G -6 -2   2   0  -2  -4  C -8 -4   0   1  -1  -3

Their last lines (after reversing the latter) and sum of those are respectively

 ScoreL      = [ -8 -4  0 -2 -1 -3 ]  rev(ScoreR) = [ -3 -1  1  0 -4 -8 ]  Sum         = [-11 -5  1 -2 -5 -11]

The maximum (shown in bold) appears at ymid = 2, producing the partition .

The entire Hirschberg recursion (which we omit for brevity) produces the following tree:

               (AGTACGCA,TATGC)                /               \         (AGTA,TA)             (CGCA,TGC)          /     \              /        \      (AG, )   (TA,TA)      (CG,TG)     (CA,C)               /   \        /   \                   (T,T) (A,A)  (C,T) (G,G) 

The leaves of the tree contain the optimal alignment.

See also

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Dynamic time warping</span> An algorithm for measuring similarity between two temporal sequences, which may vary in speed

In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

<span class="mw-page-title-main">Levenshtein distance</span> Computer science metric for string similarity

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

<span class="mw-page-title-main">Gauss–Newton algorithm</span> Mathematical algorithm

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

In cryptography, M8 is a block cipher designed by Hitachi in 1999. It is a modification of Hitachi's earlier M6 algorithm, designed for greater security and high performance in both hardware and 32-bit software implementations. M8 was registered by Hitachi in March 1999 as ISO/IEC 9979-0020.

In mathematics, a cardinal function is a function that returns cardinal numbers.

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.

Quantum optimization algorithms are quantum algorithms that are used to solve optimization problems. Mathematical optimization deals with finding the best solution to a problem from a set of possible solutions. Mostly, the optimization problem is formulated as a minimization problem, where one tries to minimize an error which depends on the solution: the optimal solution has the minimal error. Different optimization techniques are applied in various fields such as mechanics, economics and engineering, and as the complexity and amount of data involved rise, more efficient ways of solving optimization problems are needed. Quantum computing may allow problems which are not practically feasible on classical computers to be solved, or suggest a considerable speed up with respect to the best known classical algorithm.

In computer science, the Aharonov–Jones–Landau algorithm is an efficient quantum algorithm for obtaining an additive approximation of the Jones polynomial of a given link at an arbitrary root of unity. Finding a multiplicative approximation is a #P-hard problem, so a better approximation is considered unlikely. However, it is known that computing an additive approximation of the Jones polynomial is a BQP-complete problem.

References

  1. Hirschberg's algorithm.
  2. "The Algorithm".
  3. Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM . 18 (6): 341–343. CiteSeerX   10.1.1.348.4774 . doi:10.1145/360825.360861. MR   0375829. S2CID   207694727.