Double Cut and Join Model

Last updated April 16, 2020

The double cut and join (DCJ) model is a model for genome rearrangement used to define an edit distance between genomes based on gene order and orientation, rather than nucleotide sequence. It takes the fundamental units of a genome to be synteny blocks, maximal sections of DNA conserved between genomes. It focuses on changes due to genome rearrangement operations such as inversions, translocations as well as the creation and absorption of circular intermediates.^[1]^[2]

A genome is described as a directed, edge labeled graph with each vertex having degree 1 or 2. Edges are labeled as synteny blocks, vertices of degree 1 represent telomeres, and vertices of degree 2 representing adjacencies between blocks. This requires that the genome consist of cycles and paths. Each component is called a chromosome. The beginning of each edge is called the tail, the end of each edge is called the head; together heads and tails are known as extremities. Vertices are described by their roles as heads and tails of blocks, for instance, in the figure, the adjacency which forms the head of marker 1 and the tail of marker 2 is labelled (h1, t2), the telomere formed by the head of 2 is (h2). A double cut and join (DCJ) operation consists of one of the following four transformations:

(i) breaking two adjacencies (a, b) and (c, d) to create two more adjacencies, (a, c) and (b,d)
(ii) taking an adjacency (a, b) and a telomere (c) to create a new adjacency and telomere, either as (a, c), (b) or (b,c), (a).
(iii) taking two telomeres (a) and (b) and creating a new adjacency (a, b)
(iv) breaking an adjacency (a, b) to create the two telomeres (a) and (b).

An edit distance, the double cut and join distance, is defined between genomes with the same number of edges $G_{1}$ and $G_{2}$ , $d_{DCJ}(A,B)$ as the minimum number of DCJ operations needed to transform $A$ into $B$ .

Mathematical Results

The DCJ distance defines a metric space. To verify this, we first note that $d_{DCJ}(G,G)=0$ , since no operations are needed to change G into itself, and if $G_{1}\neq G_{2}$ , $d_{DCJ}(G_{1},G_{2})>0$ , since at least one operation is needed to transform $G_{1}$ into any other genome. (A proof that the $d_{DCJ}(G_{1},G_{2})$ is always defined whenever $G_{1}$ and $G_{2}$ are genomes with the same edges will follow.) Note that each operation has an inverse: (i) and (ii) are their own inverses, and (iii) is the inverse of (iv). Thus $d_{DCJ}(G_{1},G_{2})=d_{DCJ}(G_{2},G_{1})$ . The triangle inequality $d_{DCJ}(G_{1},G_{3})\leq d_{DCJ}(G_{1},G_{2})+d_{DCJ}(G_{2},G_{3})$ holds because a series of DCJ operations transforming $G_{1}$ to $G_{2}$ followed by a series of transformations from $G_{2}$ to $G_{3}$ will transform $G_{1}$ to $G_{3}$ , so the minimal number of operations needed to transform $G_{1}$ to $G_{3}$ must be no longer than this.

To compute the DCJ distance between two genomes $G_{1}$ and $G_{2}$ with the same set of synteny blocks, we construct a bipartite multigraph known as the adjacency graph $A=(V(G_{1}),V(G_{2}),E)$ of the genomes. The vertex set of the adjacency graph is $(V(G_{1}),V(G_{2}))$ , where $V(G_{1})$ is the vertex set of $G_{1}$ and $V(G_{2})$ is the vertex set of $G_{2}$ . For $a\in V(G_{1})$ and $b\in V(G_{2})$ , we have $(a,b)\in E$ if $a$ and $b$ are an extremity of the same synteny block. If $a$ and $b$ share two extremities, we add two edges between $a$ and $b$ to $G$ .

If $A=B$ , we see that the adjacency graph is composed entirely of paths of length 1, connecting two telomeres, and cycles of length 2, connecting two adjacencies. We can use this fact to calculate $d_{DCJ}$ . Let $N$ be the number of synteny blocks in genomes $G_{1}$ and $G_{2}$ , $C$ be the number of cycles in their adjacency graph, and $I$ be the number of paths in their adjacency graph. Then $d_{DCJ}(G_{1},G_{2})=N-(C+1/2).$ The proof shows that each DCJ operation can decrease $N-(C+1/2)$ by no more than 1, and that if $G_{1}\neq G_{2}$ , there exists an operation decreasing $N-(C+I/2).$ . This proves that $d_{DCJ}$ is always defined, and gives a method for its calculation. Since it is easy to count cycles and paths, $d_{DCJ}$ can be found in linear time.^[3]

Related Research Articles

Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node and explores as far as possible along each branch before backtracking.

In computer science, the Floyd–Warshall algorithm is an algorithm for finding shortest paths in a weighted graph with positive or negative edge weights. A single execution of the algorithm will find the lengths of shortest paths between all pairs of vertices. Although it does not return details of the paths themselves, it is possible to reconstruct the paths with simple modifications to the algorithm. Versions of the algorithm can also be used for finding the transitive closure of a relation $, or widest paths between all pairs of vertices in a weighted graph.$

In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

Informally, the reconstruction conjecture in graph theory says that graphs are determined uniquely by their subgraphs. It is due to Kelly and Ulam.

This is a glossary of graph theory terms. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by edges.

In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications are found in all areas in which complex networks need to be modeled – many random graph models are thus known, mirroring the diverse types of complex networks encountered in different areas. In a mathematical context, random graph refers almost exclusively to the Erdős–Rényi random graph model. In other contexts, any graph model may be referred to as a random graph.

In graph theory, Turán's theorem is a result on the number of edges in a K_r+1-free graph.

In optimization theory, maximum flow problems involve finding a feasible flow through a flow network that obtains the maximum possible flow rate.

In graph theory, graph coloring is a special case of graph labeling; it is an assignment of labels traditionally called "colors" to elements of a graph subject to certain constraints. In its simplest form, it is a way of coloring the vertices of a graph such that no two adjacent vertices are of the same color; this is called a vertex coloring. Similarly, an edge coloring assigns a color to each edge so that no two adjacent edges are of the same color, and a face coloring of a planar graph assigns a color to each face or region so that no two faces that share a boundary have the same color.

Graph (abstract data type) abstract data type in computer science

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

In the mathematical discipline of graph theory, the line graph of an undirected graph G is another graph L(G) that represents the adjacencies between edges of G. The name line graph comes from a paper by Harary & Norman (1960) although both Whitney (1932) and Krausz (1943) used the construction before this. Other terms used for the line graph include the covering graph, the derivative, the edge-to-vertex dual, the conjugate, the representative graph, and the ϑ-obrazom, as well as the edge graph, the interchange graph, the adjoint graph, and the derived graph.

In the mathematical field of graph theory, the Laplacian matrix, sometimes called admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. The Laplacian matrix can be used to find many useful properties of a graph. Together with Kirchhoff's theorem, it can be used to calculate the number of spanning trees for a given graph. The sparsest cut of a graph can be approximated through the second smallest eigenvalue of its Laplacian by Cheeger's inequality. It can also be used to construct low dimensional embeddings, which can be useful for a variety of machine learning applications.

In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin. They should not be confused with node influence metrics, which seek to quantify the influence of every node in the network.

In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex $can reach a vertex if there exists a sequence of adjacent vertices which starts with and ends with .$

In the mathematical field of graph theory, the Rado graph, Erdős–Rényi graph, or random graph is a countably infinite graph that can be constructed by choosing independently at random for each pair of its vertices whether to connect the vertices by an edge. The names of this graph honor Richard Rado, Paul Erdős, and Alfréd Rényi, mathematicians who studied it in the early 1960s; it appears even earlier in the work of Wilhelm Ackermann (1937). The Rado graph can also be constructed non-randomly, by symmetrizing the membership relation of the hereditarily finite sets, by applying the BIT predicate to the binary representations of the natural numbers, or as an infinite Paley graph that has edges connecting pairs of prime numbers congruent to 1 mod 4 that are quadratic residues modulo each other.

In graph theory, circular coloring may be viewed as a refinement of usual graph coloring. The circular chromatic number of a graph $, denoted can be given by any of the following definitions, all of which are equivalent.$

$is the infimum over all real numbers so that there exists a map from to a circle of circumference 1 with the property that any two adjacent vertices map to points at distance along this circle.$
$is the infimum over all rational numbers so that there exists a map from to the cyclic group with the property that adjacent vertices map to elements at distance apart.$
In an oriented graph, declare the imbalance of a cycle $to be divided by the minimum of the number of edges directed clockwise and the number of edges directed counterclockwise. Define the imbalance of the oriented graph to be the maximum imbalance of a cycle. Now, is the minimum imbalance of an orientation of .$

In graph theory and statistics, a graphon is a symmetric measurable function $, that is important in the study of dense graphs. Graphons arise both as a natural notion for the limit of a sequence of dense graphs, and as the fundamental defining objects of exchangeable random graph models. Graphons are tied to dense graphs by the following pair of observations: the random graph models defined by graphons give rise to dense graphs almost surely, and, by the regularity lemma, graphons capture the structure of arbitrary large dense graphs.$

In graph theory, a mathematical discipline, coloring refers to an assignment of colours or labels to vertices, edges and faces of a graph. Defective coloring is a variant of proper vertex coloring. In a proper vertex coloring, the vertices are coloured such that no adjacent vertices have the same colour. In defective coloring, on the other hand, vertices are allowed to have neighbours of the same colour to a certain extent.

In coding theory, expander codes form a class of error-correcting codes that are constructed from bipartite expander graphs. Along with Justesen codes, expander codes are of particular interest since they have a constant positive rate, a constant positive relative distance, and a constant alphabet size. In fact, the alphabet contains only two elements, so expander codes belong to the class of binary codes. Furthermore, expander codes can be both encoded and decoded in time proportional to the block length of the code.

In graph theory, a locally linear graph is an undirected graph in which the neighborhood of every vertex is an induced matching. That is, for every vertex $, every neighbor of should be adjacent to exactly one other neighbor of . Equivalently, every edge of such a graph belongs to a unique triangle . Locally linear graphs have also been called locally matched graphs.$

References

↑ Friedberg R, Darling AE, Yancopoulos S (2008). "Genome rearrangement by the double cut and join operation. Each of these individual operations involves 2 cuts and 2 joins of the genomic DNA". Methods Mol Biol. 452: 385–416. doi:10.1007/978-1-60327-159-2_18. PMID 18566774.
↑ Yancopoulos S, Attie O, Friedberg R (2005). "Efficient sorting of genomic permutations by translocation, inversion and block interchange". Bioinformatics. 21 (16): 3340–3346. doi: 10.1093/bioinformatics/bti535 . PMID 15951307.
↑ YAF 2005

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[FDY_2008-1] Friedberg R, Darling AE, Yancopoulos S (2008). "Genome rearrangement by the double cut and join operation. Each of these individual operations involves 2 cuts and 2 joins of the genomic DNA". Methods Mol Biol. 452: 385–416. doi:10.1007/978-1-60327-159-2_18. PMID 18566774.

[YAF_2005-2] Yancopoulos S, Attie O, Friedberg R (2005). "Efficient sorting of genomic permutations by translocation, inversion and block interchange". Bioinformatics. 21 (16): 3340–3346. doi: 10.1093/bioinformatics/bti535 . PMID 15951307.

[3] YAF 2005