Hirschberg's algorithm

Last updated

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. [1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Contents

Algorithm information

Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If x and y are strings, where length(x) = n and length(y) = m, the Needleman–Wunsch algorithm finds an optimal alignment in O(nm) time, using O(nm) space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes O(nm) time, but needs only O(min{n, m}) space and is much faster in practice. [2] One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.

The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that: [3]

  1. one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
  2. if is the optimal alignment of , and is an arbitrary partition of , there exists a partition of such that .

Algorithm description

denotes the i-th character of , where . denotes a substring of size , ranging from the i-th to the j-th character of . is the reversed version of .

and are sequences to be aligned. Let be a character from , and be a character from . We assume that , and are well defined integer-valued functions. These functions represent the cost of deleting , inserting , and replacing with , respectively.

We define , which returns the last line of the Needleman–Wunsch score matrix :

function NWScore(X, Y)     Score(0, 0) = 0 // 2 * (length(Y) + 1) array     for j = 1 to length(Y)         Score(0, j) = Score(0, j - 1) + Ins(Yj)     for i = 1 to length(X) // Init array         Score(1, 0) = Score(0, 0) + Del(Xi)         for j = 1 to length(Y)             scoreSub = Score(0, j - 1) + Sub(Xi, Yj)             scoreDel = Score(0, j) + Del(Xi)             scoreIns = Score(1, j - 1) + Ins(Yj)             Score(1, j) = max(scoreSub, scoreDel, scoreIns)         end         // Copy Score[1] to Score[0]         Score(0, :) = Score(1, :)     endfor j = 0 to length(Y)         LastLine(j) = Score(1, j)     return LastLine

Note that at any point, only requires the two most recent rows of the score matrix. Thus, is implemented in space.

The Hirschberg algorithm follows:

function Hirschberg(X, Y)     Z = ""     W = ""     if length(X) == 0         for i = 1 to length(Y)             Z = Z + '-'             W = W + Yiendelse if length(Y) == 0         for i = 1 to length(X)             Z = Z + Xi             W = W + '-'         endelse if length(X) == 1 or length(Y) == 1         (Z, W) = NeedlemanWunsch(X, Y)     else         xlen = length(X)         xmid = length(X) / 2         ylen = length(Y)          ScoreL = NWScore(X1:xmid, Y)         ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))         ymid = arg max ScoreL + rev(ScoreR)          (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)     endreturn (Z, W)

In the context of observation (2), assume that is a partition of . Index is computed such that and .

Example

Let

The optimal alignment is given by

 W = AGTACGCA  Z = --TATGC-

Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:

T   A   T   G   C0  -2  -4  -6  -8 -10  A-2  -1   0  -2  -4  -6  G-4  -3  -2  -1   0  -2  T  -6  -2  -4   0  -2  -1  A  -8  -4   0  -2  -1  -3  C -10  -6  -2  -1  -3   1  G -12  -8  -4  -3   1  -1  C -14 -10  -6  -5  -1   3A -16 -12  -8  -7  -3   1

One starts with the top level call to , which splits the first argument in half: . The call to produces the following matrix:

T   A   T   G   C     0  -2  -4  -6  -8 -10  A -2  -1   0  -2  -4  -6  G -4  -3  -2  -1   0  -2  T -6  -2  -4   0  -2  -1  A -8  -4   0  -2  -1  -3

Likewise, generates the following matrix:

C   G   T   A   T     0 -2  -4  -6  -8 -10  A -2 -1  -3  -5  -4  -6  C -4  0  -2  -4  -6  -5  G -6 -2   2   0  -2  -4  C -8 -4   0   1  -1  -3

Their last lines (after reversing the latter) and sum of those are respectively

 ScoreL      = [ -8 -4  0 -2 -1 -3 ]  rev(ScoreR) = [ -3 -1  1  0 -4 -8 ]  Sum         = [-11 -5  1 -2 -5 -11]

The maximum (shown in bold) appears at ymid = 2, producing the partition .

The entire Hirschberg recursion (which we omit for brevity) produces the following tree:

               (AGTACGCA,TATGC)                /               \         (AGTA,TA)             (CGCA,TGC)          /     \              /        \      (AG, )   (TA,TA)      (CG,TG)     (CA,C)               /   \        /   \                   (T,T) (A,A)  (C,T) (G,G) 

The leaves of the tree contain the optimal alignment.

See also

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of

<span class="mw-page-title-main">Dynamic programming</span> Problem optimization method

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Dynamic time warping</span> An algorithm for measuring similarity between two temporal sequences, which may vary in speed

In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

In cryptography, M8 is a block cipher designed by Hitachi in 1999. It is a modification of Hitachi's earlier M6 algorithm, designed for greater security and high performance in both hardware and 32-bit software implementations. M8 was registered by Hitachi in March 1999 as ISO/IEC 9979-0020.

In mathematics, a cardinal function is a function that returns cardinal numbers.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Quantum optimization algorithms are quantum algorithms that are used to solve optimization problems. Mathematical optimization deals with finding the best solution to a problem from a set of possible solutions. Mostly, the optimization problem is formulated as a minimization problem, where one tries to minimize an error which depends on the solution: the optimal solution has the minimal error. Different optimization techniques are applied in various fields such as mechanics, economics and engineering, and as the complexity and amount of data involved rise, more efficient ways of solving optimization problems are needed. Quantum computing may allow problems which are not practically feasible on classical computers to be solved, or suggest a considerable speed up with respect to the best known classical algorithm.

References

  1. Hirschberg's algorithm.
  2. "The Algorithm".
  3. Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM . 18 (6): 341–343. CiteSeerX   10.1.1.348.4774 . doi:10.1145/360825.360861. MR   0375829. S2CID   207694727.