Tree alignment

Last updated January 19, 2022

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

Definition

Input: A set $S$ of sequences, a phylogenetic tree $T$ leaf-labeled by $S$ and an edit distance function $d$ between sequences.

Output: A labeling of the internal vertices of $T$ such that $\Sigma _{e\in T}d(e)$ is minimized, where $d(e)$ is the edit distance between the endpoints of $e$ .

The task is NP-hard.^[1]

Background

Sequence alignment

In bioinformatics, the basic method of information processing is to contrast the sequence data. Biologists use it to discover the function, structure, and evolutionary information in biological sequences. The following analyses are based on the sequence assembly: the phylogenetic analysis, the haplotype comparison, and the prediction of RNA structure. Therefore, the efficiency of sequence alignment will directly affect the efficacy of solving these problems. In order to design a rational and efficient sequence alignment, the algorithm derivation becomes an important branch of research in the field of bioinformatics.

Generally, sequence alignment means constructing a string from two or more given strings with the greatest similarity by adding letters, deleting letters, or adding a space for each string. The multiple sequence alignment problem is generally based on pairwise sequence alignment and currently, for a pairwise sequence alignment problem, biologists can use a dynamic programming approach to obtain its optimal solution. However, the multiple sequence alignment problem is still one of the more challenging problems in bioinformatics. This is because finding the optimal solution for multiple sequence alignment has been proven as an NP-complete problem and only an approximate optimal solution can be obtained.^[2]

Distance matrix method

Distance method measures the minimum operation number of character insertions, deletions, and substitutions that are required to transform one sequence u to the other sequence v when being operated on a pair of strings. The calculation of edit distance can be based on dynamic programming, and the equation is in O(|u|×|v|) time, where |u| and |v| are the lengths of u and v.^[3] The efficient estimation of edit distance is essential as Distance method is a basic principle in computational biology ^[4] For functions of hereditary properties "symmetrization" can be used. Due to a series of functions being used to calculate edit distance, different functions may result in different results. Finding an optimal edit distance function is essential for the tree alignment problem.

The problem of tree alignment

This figure indicates the growth rate about the exponential time, the polynomial time and the linear time Exponential.svg — This figure indicates the growth rate about the exponential time, the polynomial time and the linear time

Tree alignment results in a NP-hard problem, where scoring modes and alphabet sizes are restricted. It can be found as an algorithm, which is used to find the optimized solution. However, there is an exponential relationship between its efficiency and the number sequences, which means that when the length of the sequence is very large, the computation time required to get results is enormously long. Using star alignment to get the approximate optimized solution is faster than using tree alignment. However, whatever the degree of multiple-sequence similarity is, the time complexity of star alignment has a proportional relationship with the square of the sequence number and the square of the sequence average length. As usual, the sequence in MSA is so long that it is also inefficient or even unacceptable. Therefore, the challenge of reducing the time complexity to linear is one of the core issues in tree alignment.

Combinatorial optimization strategy

Combinatorial optimization is a good strategy to solve MSA problems. The idea of combinatorial optimization strategy is to transform the multiple sequence alignment into pair sequence alignment to solve this problem. Depending on its transformation strategy, the combinatorial optimization strategy can be divided into the tree alignment algorithm and the star alignment algorithm. For a given multi-sequence set $S$ ={ $s_{1}$ ,..., $s_{n}$ }, find an evolutionary tree which has n leaf nodes and establishing one to one relationship between this evolutionary tree and the set $S$ . By assigning the sequence to the internal nodes of the evolutionary tree, we calculate the total score of each edge, and the sum of all edges' scores is the score of the evolutionary tree. The aim of tree alignment is to find an assigned sequence, which can obtain a maximum score, and get the final matching result from the evolutionary tree and its nodes' assigned sequence. Star alignment can be seen as a special case of the tree alignment. When we use star alignment, the evolutionary tree has only one internal node and n leaf nodes. The sequence, which is assigned to the internal node, is called the core sequence.^[5]

The keyword tree theory and the Aho-Corasick search algorithm

When the combinatorial optimization strategy is used to transform the multiple sequence alignment into pair sequence alignment, the main problem is changed from "How to improve the efficiency of multiple sequence alignment" to "How to improve the efficiency of pairwise sequence alignment." The Keyword Tree Theory and the Aho-Corasick search algorithm is an efficient approach to solve the pairwise sequence alignment problem. The aim of combining the keyword tree theory and the Aho-Corasick search algorithm is to solve this kind of problem: for a given long string $T$ and a set of short strings $P$ ={ $p_{1}$ , $p_{2}$ ,... , $p_{z}$ } (z∈N,z>1), find the location of all $P_{i}$ in $T$ . The keyword tree produced by set $P$ is used, and then searched for in $T$ with this keyword tree through the Aho-Corasick search algorithm.^[6] The total time complexity of using this method to find all $P_{i}$ 's location in the T is O( $m$ + $n$ + $k$ ), where $m$ =| $T$ | (the length of $T$ ), $n$ =Σ| $P_{i}$ | (the sum of all $P_{i}$ 's lengths) and $k$ means the sum of occurrence for all $P_{i}$ in $T$ .

Keyword tree theory

The keyword tree of the set $P$ ={ $p_{1}$ , $p_{2}$ ,... , $p_{z}$ } (z∈N,z>1) is a rooted tree, whose root denoted by K, and this keyword tree satisfies:

(1): Each edge clearly demarcates one letter.

(2): Any two edges separated from the same node are to correspond to different letters.

(3) Each pattern $P_{i}$ (i=1,2,...,z) corresponds to a node $v$ , and the path from the root K to the node $v$ can exactly correctly spell the string $P_{i}$ .

For each leaf node of this K tree, it corresponds to one of the certain patterns of set $P$ .

$L(v)$ is used to represent the STRING which is connected from the root node to the node $v$ . $Lp(v)$ will then be used to represent the length of the longest suffix (also, this suffix is the prefix of one of patterns in the set $P$ ). Searching this prefix from the root node in the keyword tree, and the last node denoted by $n_{v}$ when the search is over.^{[ clarification needed ]}^[7]

For example, the set $P$ ={potato, tattoo, theater, other}, and the keyword tree is shown on the right. In that example, if $L(v)$ =potat, then $Lp(v)$ =|tat|=3, and the failure link of the node $v$ is shown in that figure.

Establishing a failure link is the key to improve the time complexity of the Aho-Corasick algorithm. It can be used to reduce the original polynomial time to the linear time for searching. Therefore, the core of keyword tree theory is to find all failure links (which also means finding all $n_{v}$ s) of a keyword tree in the linear time. It is assumed that every $n_{v}$ of all nodes $v$ , whose distance from the root node is less than or equal $k$ , can be found. The $n_{v}$ of the node $v$ whose distance from the root node is $k$ + 1 can then be sought for. Its parent node is $v'$ , and the letter represented by the node $v$ and $v'$ , is $x$ .

(1): If the next letter of the node $n_{v}'$ is $x$ , the other node of this edge can be set as $w$ , and $n_{v}$ = $w$ .

(2): If all letters are not $x$ by searching all edges between $n_{v}'$ and its child nodes, $L(n_{v})$ is a suffix of $L(n_{v}')$ plus $x$ . Because this suffix matches the STRING beginning with the root node (similar to prefix), the $x$ after $n_{v}'$ can be detected or not. If not, this process can be continued until $x$ or the root node is found.

Aho-Corasick search algorithm

After establishing all failure links in the keyword tree, the Aho-Corasick search algorithm is used to find the locations of all $P_{i}$ (i=1,2,...,z) in the linear time. In this step, the time complexity is O(m+k).

Other strategies

In MSA, DNA, RNA, and proteins, sequences are usually generated and they are assumed to have an evolutionary relationship. By comparing generated maps of RNA, DNA, and sequences from evolutionary families, people can assess conservation of proteins and find functional gene domains by comparing differences between evolutionary sequences. Generally, heuristic algorithms and tree alignment graphs are also adopted to solve multiple sequence alignment problems.

Heuristic algorithm

Generally, heuristic algorithms rely on the iterative strategy, which is to say that based on a comparison method, optimizing the results of multiple sequence alignment by the iterative process. Davie M. proposed using the particle swarm optimization algorithm to solve the multiple sequence alignment problem; Ikeda Takahiro proposed a heuristic algorithm which is based on the A* search algorithm; E. Birney first proposed using the hidden Markov model to solve the multiple sequence alignment problem; and many other biologists use the genetic algorithm to solve it.^[8]^[9] All these algorithms generally are robust and insensitive to the number of sequences, but they also have shortcomings. For example, the results from the particle swarm optimization algorithm are unstable and its merits depend on the selection of random numbers, the runtime of the A* search algorithm is too long, and the genetic algorithm is easy to fall into local excellent.^{[ clarification needed ]}

Tree alignment graph

Roughly, tree alignment graph aims to align trees into a graph and finally synthesize them to develop statistics. In biology, tree alignment graphs (TAGs) are used to remove the evolutionary conflicts or overlapping taxa from sets of trees and can then be queried to explore uncertainty and conflict. By integrating methods of aligning, synthesizing and analyzing, the TAG aims to solve the conflicting relationships and partial overlapping taxon sets obtained from a wide range of sequences. Also, the tree alignment graph serves as a fundamental approach for supertree and grafting exercises, which have been successfully tested to construct supertrees by Berry.^[10] Because the transformation from trees to a graph contain similar nodes and edges from their source trees, TAGs can also provide extraction of original source trees for further analysis. TAG is a combination of a set of aligning trees. It can store conflicting hypotheses evolutionary relationship and synthesize the source trees to develop evolutionary hypotheses. Therefore, it is a basic method to solve other alignment problems.^[11]

Related Research Articles

The travelling salesman problem asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem in combinatorial optimization, important in theoretical computer science and operations research.

In graph theory, the shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

Breadth-first search Algorithm for searching the nodes of a graph in order by their hop count from a starting node

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

A phylogenetic tree is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. All life on Earth is part of a single phylogenetic tree, indicating common ancestry.

In graph theory, a tree decomposition is a mapping of a graph into a tree that can be used to define the treewidth of the graph and speed up solving certain computational problems on the graph.

In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches.

In computer science, a binary decision diagram (BDD) or branching program is a data structure that is used to represent a Boolean function. On a more abstract level, BDDs can be considered as a compressed representation of sets or relations. Unlike other compressed representations, operations are performed directly on the compressed representation, i.e. without decompression.

In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. Artificial ants stand for multi-agent methods inspired by the behavior of real ants. The pheromone-based communication of biological ants is often the predominant paradigm used. Combinations of artificial ants and local search algorithms have become a method of choice for numerous optimization tasks involving some sort of graph, e.g., vehicle routing and internet routing.

Belief propagation, also known as sum-product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory and has demonstrated empirical success in numerous applications including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

In computational complexity theory and computability theory, a search problem is a type of computational problem represented by a binary relation. If R is a binary relation such that field(R) ⊆ Γ⁺ and T is a Turing machine, then T calculates R if:

Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding an uninformative prior over admissible solutions and ending with the model that generates only the global optima.

Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species and the relationships between specific genes shared by many types of organisms.

A zero-suppressed decision diagram is a particular kind of binary decision diagram (BDD) with fixed variable ordering. This data structure provides a canonically compact representation of sets, particularly suitable for certain combinatorial problems. Recall the Ordered Binary Decision Diagram (OBDD) reduction strategy, i.e. a node is removed from the decision tree if both out-edges point to the same node. In contrast, a node in a ZDD is removed if its positive edge points to the terminal node 0. This provides an alternative strong normal form, with improved compression of sparse sets. It is based on a reduction rule devised by Shin-ichi Minato in 1993.

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes $v$ and $w$ in a tree or directed acyclic graph (DAG) $T$ is the lowest node that has both $v$ and $w$ as descendants, where we define each node to be a descendant of itself.

In mathematics and computer science, an unrooted binary tree is an unrooted tree in which each vertex has either one or three neighbors.

In graph theory, the modular decomposition is a decomposition of a graph into subsets of vertices called modules. A module is a generalization of a connected component of a graph. Unlike connected components, however, one module can be a proper subset of another. Modules therefore lead to a recursive (hierarchical) decomposition of the graph, instead of just a partition.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.

References

↑ Elias, Isaac (2006), "Settling the intractability of multiple alignment", J Comput Biol, 13 (7): 1323–1339, CiteSeerX 10.1.1.6.256 , doi:10.1089/cmb.2006.13.1323, PMID 17037961
↑ L Wang, T Jiang. On the complexity of multiple sequence alignment[J]. Journal of Computational Biology, 194,1(4):337— 34.
↑ Yen Hung Chen, On the bottleneck tree alignment problems, INFORMATION SCIENCES; JUN 1, 2010; 180; 11; p2134-p2141
↑ Ostrovsky, Rafail; Rabani, Yuval (2007-10-01). "Low distortion embeddings for edit distance". Journal of the ACM. Association for Computing Machinery (ACM). 54 (5): 23–es. doi:10.1145/1284320.1284322. ISSN 0004-5411. S2CID 15341956.
↑ Serafim Batzoglou. The many faces of sequence alignment[J]. Briefings in Bioinformatics. 2005,6(1):6—22
↑ Aho A V, Corasick M J. Efficient string matching: an aid to bibliographic search[J]. Communications of ACM, 1975,18(6): 333—340 ^{[ permanent dead link ]}.
↑ D Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology[M]. Cambridge: Cambridge University Press.1997.
↑ RobertC Edgar, Serafim Batzoglou. Multiple sequence alignment[J]. Current Opinion in Structural Biology. 2006,16(3):368— 373 Archived 2013-10-23 at the Wayback Machine .
↑ Notredame C, Higgins D.G. SAGA:sequence alignment by genetic algorithm [J]. Nucleic Acids Research. 1996,24(8):1515-1524.
↑ Wilkinson M, Pisani D, Measuring support and finding unsupported relationships in supertrees, Systematic Biology 54:823-831.
↑ Stephen A. Smith, Joseph W. Brown, analyzing and synthesizing phylogenies using tree alignment graphs, PLoS Computational Biology 9(9).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Elias, Isaac (2006), "Settling the intractability of multiple alignment", J Comput Biol, 13 (7): 1323–1339, CiteSeerX 10.1.1.6.256 , doi:10.1089/cmb.2006.13.1323, PMID 17037961

[2] L Wang, T Jiang. On the complexity of multiple sequence alignment[J]. Journal of Computational Biology, 194,1(4):337— 34.

[3] Yen Hung Chen, On the bottleneck tree alignment problems, INFORMATION SCIENCES; JUN 1, 2010; 180; 11; p2134-p2141

[4] Ostrovsky, Rafail; Rabani, Yuval (2007-10-01). "Low distortion embeddings for edit distance". Journal of the ACM. Association for Computing Machinery (ACM). 54 (5): 23–es. doi:10.1145/1284320.1284322. ISSN 0004-5411. S2CID 15341956.

[5] Serafim Batzoglou. The many faces of sequence alignment[J]. Briefings in Bioinformatics. 2005,6(1):6—22

[6] Aho A V, Corasick M J. Efficient string matching: an aid to bibliographic search[J]. Communications of ACM, 1975,18(6): 333—340 ^{[ permanent dead link ]}.

[7] D Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology[M]. Cambridge: Cambridge University Press.1997.

[8] RobertC Edgar, Serafim Batzoglou. Multiple sequence alignment[J]. Current Opinion in Structural Biology. 2006,16(3):368— 373 Archived 2013-10-23 at the Wayback Machine .

[9] Notredame C, Higgins D.G. SAGA:sequence alignment by genetic algorithm [J]. Nucleic Acids Research. 1996,24(8):1515-1524.

[10] Wilkinson M, Pisani D, Measuring support and finding unsupported relationships in supertrees, Systematic Biology 54:823-831.

[11] Stephen A. Smith, Joseph W. Brown, analyzing and synthesizing phylogenies using tree alignment graphs, PLoS Computational Biology 9(9).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]