Sequence graph

Last updated

Sequence graph, also called an alignment graph, breakpoint graph, or adjacency graph, are bidirected graphs used in comparative genomics. The structure consists of multiple graphs or genomes with a series of edges and vertices represented as adjacencies between segments in a genome [1] and DNA segments respectively. Traversing a connected component of segments and adjacency edges (called a thread) yields a sequence, which typically represents a genome or a section of a genome. The segments can be thought of as synteny blocks, with the edges dictating how to arrange these blocks in a particular genome, and the labelling of the adjacency edges representing bases that are not contained in synteny blocks.

Contents

Construction

Before constructing a sequence graph, there must be at least two genomes represented as directed graphs with edges as threads (adjacency edges) and vertices as DNA segments. The genomes should be labeled P and Q, while the sequence graph is labeled as BreakpointGraph(P, Q). [2]

The directional vertices of Q and their edges are arranged in the order of P. Once completed, the edges of Q are reconnected to their original vertices. After all edges have been matched the vertex directions are removed and instead each vertex is labeled as vh (vertex head) and vt (vertex tail).

Similarity between genomes is represented by the number of cycles (independent systems) within the sequence graph. The number of cycles is equal to cycles (P, Q). The max number of cycles possible is equal to the number of vertices in the sequence graph.

Example

Figure example.

Upon receiving genomes P (+a +b -c) and Q (+a +b -c), [1] Q should be realigned to follow the direction edges (red) of P. The vertices should be renamed from a, b, c to ah at, bh bt, ch ct and the edges of P and Q should be connected to their original vertices (P edges = black, Q edges = green). Remove the directional edges (red). The number of cycles in G(P, Q) is 1 while the max possible is 3.

Applications

Reconstruction of ancestral genomes

Alekseyev and Pevzner use sequence graphs to create their own algorithm to study the genome rearrangement history of several mammals, as well as a way to overcome problems with current ancestral reconstruction of genomes. [1]

Multiple sequence alignment

Sequence graphs can be used to represent multiple sequence alignments with the addition of a new kind of edge representing homology between segments. [3] For a set of genomes, one can create an acyclic breakpoint graph with a thread for each genome. For two segments and , where ,,, and represent the endpoints of the two segments, homology edges can be created from to and to or from to and to - representing the two possible orientations of the homology. The advantage of representing a multiple sequence alignment this way is that it is possible to include inversions and other structural rearrangements that wouldn't be allowable in a matrix representation.

Representing variation

If there are multiple possible paths when traversing a thread in a sequence graph, multiple sequences can be represented by the same thread. This means it is possible to create a sequence graph that represents a population of individuals with slightly different genomes - with each genome corresponding to one path through the graph. These graphs have been proposed as a replacement for the reference human genome. [4]

Related Research Articles

<span class="mw-page-title-main">Depth-first search</span> Search algorithm

Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node and explores as far as possible along each branch before backtracking. Extra memory, usually a stack, is needed to keep track of the nodes discovered so far along a specified branch which helps in backtracking of the graph.

<span class="mw-page-title-main">Bipartite graph</span> Graph divided into two independent sets

In the mathematical field of graph theory, a bipartite graph is a graph whose vertices can be divided into two disjoint and independent sets and , that is, every edge connects a vertex in to one in . Vertex sets and are usually called the parts of the graph. Equivalently, a bipartite graph is a graph that does not contain any odd-length cycles.

In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

In graph theory, a branch of mathematics, the (binary) cycle space of an undirected graph is the set of its even-degree subgraphs.

This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges.

<span class="mw-page-title-main">Graph (discrete mathematics)</span> Vertices connected in pairs by edges

In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects are represented by abstractions called vertices and each of the related pairs of vertices is called an edge. Typically, a graph is depicted in diagrammatic form as a set of dots or circles for the vertices, joined by lines or curves for the edges.

<span class="mw-page-title-main">Eulerian path</span> Trail in a graph that visits each edge once

In graph theory, an Eulerian trail is a trail in a finite graph that visits every edge exactly once. Similarly, an Eulerian circuit or Eulerian cycle is an Eulerian trail that starts and ends on the same vertex. They were first discussed by Leonhard Euler while solving the famous Seven Bridges of Königsberg problem in 1736. The problem can be stated mathematically like this:

In mathematics, an incidence matrix is a logical matrix that shows the relationship between two classes of objects, usually called an incidence relation. If the first class is X and the second is Y, the matrix has one row for each element of X and one column for each element of Y. The entry in row x and column y is 1 if x and y are related and 0 if they are not. There are variations; see below.

<span class="mw-page-title-main">Edge coloring</span> Problem of coloring a graphs edges such that meeting edges do not match

In graph theory, a proper edge coloring of a graph is an assignment of "colors" to the edges of the graph so that no two incident edges have the same color. For example, the figure to the right shows an edge coloring of a graph by the colors red, blue, and green. Edge colorings are one of several different types of graph coloring. The edge-coloring problem asks whether it is possible to color the edges of a given graph using at most k different colors, for a given value of k, or with the fewest possible colors. The minimum required number of colors for the edges of a given graph is called the chromatic index of the graph. For example, the edges of the graph in the illustration can be colored by three colors but cannot be colored by two colors, so the graph shown has chromatic index three.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications, especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

<span class="mw-page-title-main">Circuit rank</span> Fewest graph edges whose removal breaks all cycles

In graph theory, a branch of mathematics, the circuit rank, cyclomatic number, cycle rank, or nullity of an undirected graph is the minimum number of edges that must be removed from the graph to break all its cycles, making it into a tree or forest. It is equal to the number of independent cycles in the graph. Unlike the corresponding feedback arc set problem for directed graphs, the circuit rank r is easily computed using the formula

In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of m symbols S = {s1, …, sm}, the set of vertices is:

In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex can reach a vertex if there exists a sequence of adjacent vertices which starts with and ends with .

<span class="mw-page-title-main">Cactus graph</span> Mathematical tree of cycles

In graph theory, a cactus is a connected graph in which any two simple cycles have at most one vertex in common. Equivalently, it is a connected graph in which every edge belongs to at most one simple cycle, or in which every block is an edge or a cycle.

<span class="mw-page-title-main">Directed graph</span> Graph with oriented edges

In mathematics, and more specifically in graph theory, a directed graph is a graph that is made up of a set of vertices connected by directed edges, often called arcs.

The Havel–Hakimi algorithm is an algorithm in graph theory solving the graph realization problem. That is, it answers the following question: Given a finite list of nonnegative integers in non-increasing order, is there a simple graph such that its degree sequence is exactly this list? A simple graph contains no double edges or loops. The degree sequence is a list of numbers in nonincreasing order indicating the number of edges incident to each vertex in the graph. If a simple graph exists for exactly the given degree sequence, the list of integers is called graphic. The Havel-Hakimi algorithm constructs a special solution if a simple graph for the given degree sequence exists, or proves that one cannot find a positive answer. This construction is based on a recursive algorithm. The algorithm was published by Havel (1955), and later by Hakimi (1962).

<span class="mw-page-title-main">Laves graph</span> Periodic spatial graph

In geometry and crystallography, the Laves graph is an infinite and highly symmetric system of points and line segments in three-dimensional Euclidean space, forming a periodic graph. Three equal-length segments meet at 120° angles at each point, and all cycles use ten or more segments. It is the shortest possible triply periodic graph, relative to the volume of its fundamental domain. One arrangement of the Laves graph uses one out of every eight of the points in the integer lattice as its points, and connects all pairs of these points that are nearest neighbors, at distance . It can also be defined, divorced from its geometry, as an abstract undirected graph, a covering graph of the complete graph on four vertices.

<span class="mw-page-title-main">Double Cut and Join Model</span>

The double cut and join (DCJ) model is a model for genome rearrangement used to define an edit distance between genomes based on gene order and orientation, rather than nucleotide sequence. It takes the fundamental units of a genome to be synteny blocks, maximal sections of DNA conserved between genomes. It focuses on changes due to genome rearrangement operations such as inversions, translocations as well as the creation and absorption of circular intermediates.

<span class="mw-page-title-main">Locally linear graph</span> Graph where every edge is in one triangle

In graph theory, a locally linear graph is an undirected graph in which every edge belongs to exactly one triangle. Equivalently, for each vertex of the graph, its neighbors are each adjacent to exactly one other neighbor, so the neighbors can be paired up into an induced matching. Locally linear graphs have also been called locally matched graphs. Their triangles form the hyperedges of triangle-free 3-uniform linear hypergraphs and the blocks of certain partial Steiner triple systems, and the locally linear graphs are exactly the Gaifman graphs of these hypergraphs or partial Steiner systems.

In algebraic topology and graph theory, graph homology describes the homology groups of a graph, where the graph is considered as a topological space. It formalizes the idea of the number of "holes" in the graph. It is a special case of a simplicial homology, as a graph is a special case of a simplicial complex. Since a finite graph is a 1-complex, the only non-trivial homology groups are the 0-th group and the 1-th group.

References

  1. 1 2 3 Alekseyev, M. A.; Pevzner, P. A. (2009-02-13). "Breakpoint graphs and ancestral genome reconstructions". Genome Research. 19 (5). Cold Spring Harbor Laboratory: 943–957. doi:10.1101/gr.082784.108. ISSN   1088-9051. PMC   2675983 . PMID   19218533.
  2. Breakpoint Graphs , retrieved 2022-05-05
  3. Paten, Benedict; Zerbino, Daniel R; Hickey, Glenn; Haussler, David (2014-06-19). "A unifying model of genome evolution under parsimony". BMC Bioinformatics . 15 (1). Springer Science and Business Media LLC: 206. doi: 10.1186/1471-2105-15-206 . ISSN   1471-2105. PMC   4082375 . PMID   24946830.
  4. Paten, Benedict; Novak, Adam; Haussler, David (2014-04-20). "Mapping to a Reference Genome Structure". arXiv: 1404.5010 [q-bio.GN].