Graph edit distance

Last updated

In mathematics and computer science, graph edit distance (GED) is a measure of similarity (or dissimilarity) between two graphs. The concept of graph edit distance was first formalized mathematically by Alberto Sanfeliu and King-Sun Fu in 1983. [1] A major application of graph edit distance is in inexact graph matching, such as error-tolerant pattern recognition in machine learning. [2]

Contents

The graph edit distance between two graphs is related to the string edit distance between strings. With the interpretation of strings as connected, directed acyclic graphs of maximum degree one, classical definitions of edit distance such as Levenshtein distance, [3] [4] Hamming distance [5] and Jaro–Winkler distance may be interpreted as graph edit distances between suitably constrained graphs. Likewise, graph edit distance is also a generalization of tree edit distance between rooted trees. [6] [7] [8] [9]

Formal definitions and properties

The mathematical definition of graph edit distance is dependent upon the definitions of the graphs over which it is defined, i.e. whether and how the vertices and edges of the graph are labeled and whether the edges are directed. Generally, given a set of graph edit operations (also known as elementary graph operations), the graph edit distance between two graphs and , written as can be defined as

where denotes the set of edit paths transforming into (a graph isomorphic to) and is the cost of each graph edit operation .

The set of elementary graph edit operators typically includes:

vertex insertion to introduce a single new labeled vertex to a graph.
vertex deletion to remove a single (often disconnected) vertex from a graph.
vertex substitution to change the label (or color) of a given vertex.
edge insertion to introduce a new colored edge between a pair of vertices.
edge deletion to remove a single edge between a pair of vertices.
edge substitution to change the label (or color) of a given edge.

Additional, but less common operators, include operations such as edge splitting that introduces a new vertex into an edge (also creating a new edge), and edge contraction that eliminates vertices of degree two between edges (of the same color). Although such complex edit operators can be defined in terms of more elementary transformations, their use allows finer parameterization of the cost function when the operator is cheaper than the sum of its constituents.

A deep analysis of the elementary graph edit operators is presented in [10] [11] [12]

And some methods have been presented to automatically deduce these elementary graph edit operators. [13] [14] [15] [16] [17] And some algorithms learn these costs online: [18]

Applications

Graph edit distance finds applications in handwriting recognition, [19] fingerprint recognition [20] and cheminformatics. [21]

Algorithms and complexity

Exact algorithms for computing the graph edit distance between a pair of graphs typically transform the problem into one of finding the minimum cost edit path between the two graphs. The computation of the optimal edit path is cast as a pathfinding search or shortest path problem, often implemented as an A* search algorithm.

In addition to exact algorithms, a number of efficient approximation algorithms are also known. Most of them have cubic computational time [22] [23] [24] [25] [26]

Moreover, there is an algorithm that deduces an approximation of the GED in linear time [27]

Despite the above algorithms sometimes working well in practice, in general the problem of computing graph edit distance is NP-hard (for a proof that's available online, see Section 2 of Zeng et al.), and is even hard to approximate (formally, it is APX-hard [28] ).

Related Research Articles

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

<span class="mw-page-title-main">Graph coloring</span> Methodic assignment of colors to elements of a graph

In graph theory, graph coloring is a special case of graph labeling; it is an assignment of labels traditionally called "colors" to elements of a graph subject to certain constraints. In its simplest form, it is a way of coloring the vertices of a graph such that no two adjacent vertices are of the same color; this is called a vertex coloring. Similarly, an edge coloring assigns a color to each edge so that no two adjacent edges are of the same color, and a face coloring of a planar graph assigns a color to each face or region so that no two faces that share a boundary have the same color.

<span class="mw-page-title-main">Spanning tree</span> Tree which includes all vertices of a graph

In the mathematical field of graph theory, a spanning treeT of an undirected graph G is a subgraph that is a tree which includes all of the vertices of G. In general, a graph may have several spanning trees, but a graph that is not connected will not contain a spanning tree. If all of the edges of G are also edges of a spanning tree T of G, then G is a tree and is identical to T.

<span class="mw-page-title-main">Graph (abstract data type)</span> Abstract data type in computer science

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

<span class="mw-page-title-main">Vertex cover</span> Subset of a graphs vertices, including at least one endpoint of every edge

In graph theory, a vertex cover of a graph is a set of vertices that includes at least one endpoint of every edge of the graph.

In the mathematical discipline of graph theory, a matching or independent edge set in an undirected graph is a set of edges without common vertices. In other words, a subset of the edges is a matching if each vertex appears in at most one edge of that matching. Finding a matching in a bipartite graph can be treated as a network flow problem.

<span class="mw-page-title-main">Perfect graph</span> Graph with tight clique-coloring relation

In graph theory, a perfect graph is a graph in which the chromatic number equals the size of the maximum clique, both in the graph itself and in every induced subgraph. In all graphs, the chromatic number is greater than or equal to the size of the maximum clique, but they can be far apart. A graph is perfect when these numbers are equal, and remain equal after the deletion of arbitrary subsets of vertices.

<span class="mw-page-title-main">Edge coloring</span> Problem of coloring a graphs edges such that meeting edges do not match

In graph theory, a proper edge coloring of a graph is an assignment of "colors" to the edges of the graph so that no two incident edges have the same color. For example, the figure to the right shows an edge coloring of a graph by the colors red, blue, and green. Edge colorings are one of several different types of graph coloring. The edge-coloring problem asks whether it is possible to color the edges of a given graph using at most k different colors, for a given value of k, or with the fewest possible colors. The minimum required number of colors for the edges of a given graph is called the chromatic index of the graph. For example, the edges of the graph in the illustration can be colored by three colors but cannot be colored by two colors, so the graph shown has chromatic index three.

<span class="mw-page-title-main">Euclidean minimum spanning tree</span> Shortest network connecting points

A Euclidean minimum spanning tree of a finite set of points in the Euclidean plane or higher-dimensional Euclidean space connects the points by a system of line segments with the points as endpoints, minimizing the total length of the segments. In it, any two points can reach each other along a path through the line segments. It can be found as the minimum spanning tree of a complete graph with the points as vertices and the Euclidean distances between points as edge weights.

<span class="mw-page-title-main">Cograph</span> Graph formed by complementation and disjoint union

In graph theory, a cograph, or complement-reducible graph, or P4-free graph, is a graph that can be generated from the single-vertex graph K1 by complementation and disjoint union. That is, the family of cographs is the smallest class of graphs that includes K1 and is closed under complementation and disjoint union.

The art gallery problem or museum problem is a well-studied visibility problem in computational geometry. It originates from the following real-world problem:

"In an art gallery, what is the minimum number of guards who together can observe the whole gallery?"

<span class="mw-page-title-main">Kőnig's theorem (graph theory)</span> Theorem showing that maximum matching and minimum vertex cover are equivalent for bipartite graphs

In the mathematical area of graph theory, Kőnig's theorem, proved by Dénes Kőnig (1931), describes an equivalence between the maximum matching problem and the minimum vertex cover problem in bipartite graphs. It was discovered independently, also in 1931, by Jenő Egerváry in the more general case of weighted graphs.

In computer science, the Hopcroft–Karp algorithm is an algorithm that takes a bipartite graph as input and produces a maximum-cardinality matching as output — a set of as many edges as possible with the property that no two edges share an endpoint. It runs in time in the worst case, where is set of edges in the graph, is set of vertices of the graph, and it is assumed that . In the case of dense graphs the time bound becomes , and for sparse random graphs it runs in time with high probability.

Maximum cardinality matching is a fundamental problem in graph theory. We are given a graph G, and the goal is to find a matching containing as many edges as possible; that is, a maximum cardinality subset of the edges such that each vertex is adjacent to at most one edge of the subset. As each edge will cover exactly two vertices, this problem is equivalent to the task of finding a matching that covers as many vertices as possible.

In graph theory, a branch of mathematics, a skew-symmetric graph is a directed graph that is isomorphic to its own transpose graph, the graph formed by reversing all of its edges, under an isomorphism that is an involution without any fixed points. Skew-symmetric graphs are identical to the double covering graphs of bidirected graphs.

In computer science, lexicographic breadth-first search or Lex-BFS is a linear time algorithm for ordering the vertices of a graph. The algorithm is different from a breadth-first search, but it produces an ordering that is consistent with breadth-first search.

In graph theory, the blossom algorithm is an algorithm for constructing maximum matchings on graphs. The algorithm was developed by Jack Edmonds in 1961, and published in 1965. Given a general graph G = (V, E), the algorithm finds a matching M such that each vertex in V is incident with at most one edge in M and |M| is maximized. The matching is constructed by iteratively improving an initial empty matching along augmenting paths in the graph. Unlike bipartite matching, the key new idea is that an odd-length cycle in the graph (blossom) is contracted to a single vertex, with the search continuing iteratively in the contracted graph.

In computer science, the method of contraction hierarchies is a speed-up technique for finding the shortest-path in a graph. The most intuitive applications are car-navigation systems: a user wants to drive from to using the quickest possible route. The metric optimized here is the travel time. Intersections are represented by vertices, the road sections connecting them by edges. The edge weights represent the time it takes to drive along this segment of the road. A path from to is a sequence of edges ; the shortest path is the one with the minimal sum of edge weights among all possible paths. The shortest path in a graph can be computed using Dijkstra's algorithm but, given that road networks consist of tens of millions of vertices, this is impractical. Contraction hierarchies is a speed-up method optimized to exploit properties of graphs representing road networks. The speed-up is achieved by creating shortcuts in a preprocessing phase which are then used during a shortest-path query to skip over "unimportant" vertices. This is based on the observation that road networks are highly hierarchical. Some intersections, for example highway junctions, are "more important" and higher up in the hierarchy than for example a junction leading into a dead end. Shortcuts can be used to save the precomputed distance between two important junctions such that the algorithm doesn't have to consider the full path between these junctions at query time. Contraction hierarchies do not know about which roads humans consider "important", but they are provided with the graph as input and are able to assign importance to vertices using heuristics.

In graph theory, a branch of mathematics, an indifference graph is an undirected graph constructed by assigning a real number to each vertex and connecting two vertices by an edge when their numbers are within one unit of each other. Indifference graphs are also the intersection graphs of sets of unit intervals, or of properly nested intervals. Based on these two types of interval representations, these graphs are also called unit interval graphs or proper interval graphs; they form a subclass of the interval graphs.

<span class="mw-page-title-main">Frankl–Rödl graph</span>

In graph theory and computational complexity theory, a Frankl–Rödl graph is a graph defined by connecting pairs of vertices of a hypercube that are at a specified even distance from each other. The graphs of this type are parameterized by the dimension of the hypercube and by the distance between adjacent vertices.

References

  1. Sanfeliu, Alberto; Fu, King-Sun (1983). "A distance measure between attributed relational graphs for pattern recognition". IEEE Transactions on Systems, Man, and Cybernetics . 13 (3): 353–363. doi:10.1109/TSMC.1983.6313167. S2CID   1087693.
  2. Gao, Xinbo; Xiao, Bing; Tao, Dacheng; Li, Xuelong (2010). "A survey of graph edit distance". Pattern Analysis and Applications. 13: 113–129. doi:10.1007/s10044-008-0141-y.
  3. Влади́мир И. Левенштейн (1965). Двоичные коды с исправлением выпадений, вставок и замещений символов[Binary codes capable of correcting deletions, insertions, and reversals]. Доклады Академий Наук СССР (in Russian). 163 (4): 845–848.
  4. Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady . 10 (8): 707–710.
  5. Hamming, Richard W. (1950). "Error detecting and error correcting codes" (PDF). Bell System Technical Journal . 29 (2): 147–160. doi:10.1002/j.1538-7305.1950.tb00463.x. hdl:10945/46756. MR   0035935. S2CID   61141773. Archived from the original on 2006-05-25.{{cite journal}}: CS1 maint: bot: original URL status unknown (link)
  6. Shasha, D; Zhang, K (1989). "Simple fast algorithms for the editing distance between trees and related problems". SIAM J. Comput. 18 (6): 1245–1262. CiteSeerX   10.1.1.460.5601 . doi:10.1137/0218082. S2CID   10970317.
  7. Zhang, K (1996). "A constrained edit distance between unordered labeled trees". Algorithmica . 15 (3): 205–222. doi:10.1007/BF01975866. S2CID   20043881.
  8. Bille, P (2005). "A survey on tree edit distance and related problems". Theor. Comput. Sci. 337 (1–3): 22–34. CiteSeerX   10.1.1.100.2577 . doi: 10.1016/j.tcs.2004.12.030 .
  9. Demaine, Erik D.; Mozes, Shay; Rossman, Benjamin; Weimann, Oren (2010). "An optimal decomposition algorithm for tree edit distance". ACM Transactions on Algorithms . 6 (1): A2. arXiv: cs/0604037 . CiteSeerX   10.1.1.163.6937 . doi:10.1145/1644015.1644017. MR   2654906. S2CID   7878119.
  10. Serratosa, Francesc (2021). Redefining the Graph Edit Distance. S. N. Computer Science, pp: 2-438.
  11. Serratosa, Francesc (2019). Graph edit distance: Restrictions to be a metric. Pattern Recognition, 90, pp: 250-256.
  12. Serratosa, Francesc; Cortés, Xavier (2015). Graph Edit Distance: moving from global to local structure to solve the graph-matching problem. Pattern Recognition Letters, 65, pp: 204-210.
  13. Santacruz, Pep; Serratosa, Francesc (2020). Learning the graph edit costs based on a learning model applied to sub-optimal graph matching. Neural Processing Letters, 51, pp: 881–904.
  14. Algabli, Shaima; Serratosa, Francesc (2018). Embedding the node-to-node mappings to learn the Graph edit distance parameters. Pattern Recognition Letters, 112, pp: 353-360.
  15. Xavier, Cortés; Serratosa, Francesc (2016). Learning Graph Matching Substitution Weights based on the Ground Truth Node Correspondence. International Journal of Pattern Recognition and Artificial Intelligence, 30(2), pp: 1650005 [22 pages].
  16. Xavier, Cortés; Serratosa, Francesc (2015). Learning Graph-Matching Edit-Costs based on the Optimality of the Oracle's Node Correspondences. Pattern Recognition Letters, 56, pp: 22 - 29.
  17. Conte, Donatello; Serratosa, Francesc (2020). Interactive Online Learning for Graph Matching using Active Strategies. Knowledge Based Systems, 105, pp: 106275.
  18. Rica, Elena; Álvarez, Susana; Serratosa, Francesc (2021). On-line learning the graph edit distance costs. Pattern Recognition Letters, 146, pp: 52-62.
  19. Fischer, Andreas; Suen, Ching Y.; Frinken, Volkmar; Riesen, Kaspar; Bunke, Horst (2013), "A Fast Matching Algorithm for Graph-Based Handwriting Recognition", Graph-Based Representations in Pattern Recognition, Lecture Notes in Computer Science, vol. 7877, pp. 194–203, doi:10.1007/978-3-642-38221-5_21, ISBN   978-3-642-38220-8
  20. Neuhaus, Michel; Bunke, Horst (2005), "A Graph Matching Based Approach to Fingerprint Classification using Directional Variance", Audio- and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, vol. 3546, pp. 191–200, doi:10.1007/11527923_20, ISBN   978-3-540-27887-0
  21. Birchall, Kristian; Gillet, Valerie J.; Harper, Gavin; Pickett, Stephen D. (Jan 2006). "Training Similarity Measures for Specific Activities: Application to Reduced Graphs". Journal of Chemical Information and Modeling . 46 (2): 557–586. doi:10.1021/ci050465e. PMID   16562986.
  22. Neuhaus, Michel; Bunke, Horst (Nov 2007). Bridging the Gap between Graph Edit Distance and Kernel Machines. Machine Perception and Artificial Intelligence. Vol. 68. World Scientific. ISBN   978-9812708175.
  23. Riesen, Kaspar (Feb 2016). Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Advances in Computer Vision and Pattern Recognition. Springer. ISBN   978-3319272511.
  24. Serratosa, Francesc (2014). Fast Computation of Bipartite Graph Matching. Pattern Recognition Letters, 45, pp: 244 - 250.
  25. Serratosa, Francesc (2015). Speeding up Fast Bipartite Graph Matching through a new cost matrix. International Journal of Pattern Recognition and Artificial Intelligence, 29 (2), 1550010, [17 pages].
  26. Serratosa, Francesc (2015). Computation of Graph Edit Distance: Reasoning about Optimality and Speed-up. Image and Vision Computing, 40, pp: 38-48.
  27. Santacruz, Pep; Serratosa, Francesc (2018). Error-tolerant graph matching in linear computational cost using an initial small partial matching. Pattern Recognition Letters.
  28. Lin, Chih-Long (1994-08-25). "Hardness of approximating graph transformation problem". In Du, Ding-Zhu; Zhang, Xiang-Sun (eds.). Algorithms and Computation. Lecture Notes in Computer Science. Vol. 834. Springer Berlin Heidelberg. pp. 74–82. doi:10.1007/3-540-58325-4_168. ISBN   9783540583257.