Pathfinder network

Last updated

A method for pruning dense networks to highlight key links

Contents

Rationale

Relationships among a set of elements are often represented as a square matrix with entries representing the relations between all pairs of the elements. Relations such as distances, dissimilarities, similarities, relatedness, correlations, co-occurrences, conditional probabilities, etc., can be represented by such matrices. Such data can also be represented as networks with weighted links between the elements. Such matrices and networks are extremely dense and are not easily apprehended without some form of data reduction or pruning.

A pathfinder network results from applying a pruning method that removes weaker links from a (usually dense) network according to the lengths of alternative paths (see below). [1] [2] [3] It is used as a psychometric scaling method based on graph theory and used in the study of expertise, education, [4] knowledge acquisition, mental models, [5] and knowledge engineering. It is also employed in generating communication networks, [6] software debugging, [7] visualizing scientific citation patterns, [8] information retrieval, and other forms of data visualization. [9] Pathfinder networks are potentially applicable to any problem addressed by network theory.

Overview

Network pruning aims to highlight the more important links between elements represented in a network. It helps to simplify the collection of connections involved which is valuable in data visualization and in comprehending essential relations among the elements represented in the network.

Several psychometric scaling methods start from pairwise data and yield structures revealing the underlying organization of the data. Data clustering and multidimensional scaling are two such methods. Network scaling represents another method based on graph theory. Pathfinder networks are derived from matrices of data for pairs of entities. Because the algorithm uses distances, similarity data are inverted to yield dissimilarities for the computations.

In the pathfinder network, the entities correspond to the nodes of the generated network, and the links in the network are determined by the patterns of proximities. For example, if the proximities are similarities, links will generally connect nodes of high similarity. When proximities are distances or dissimilarities, links will connect the shorter distances. The links in the network will be undirected if the proximities are symmetrical for every pair of entities. Symmetrical proximities mean that the order of the entities is not important, so the proximity of i and j is the same as the proximity of j and i for all pairs i,j. If the proximities are not symmetrical for every pair, the links will be directed.

Algorithm

The pathfinder algorithm uses two parameters.

  1. The parameter constrains the number of indirect proximities examined in generating the network. is an integer between and , inclusive where is the number of nodes or items. Shortest paths can have no more than links. When , all possible paths are included.
  2. The parameter defines the metric used for computing the distance of paths (cf. the Minkowski distance). is a real number between and , inclusive.

Path distance is computed as: , where is the distance of the link in the path and . For , is simply the sum of the distances of the links in the path. For , is the maximum of the distances of the links in the path because . A link is pruned if its distance is greater than the minimum distance of paths between the nodes connected by the link. Efficient methods for finding minimum distances include the Floyd–Warshall algorithm (for ) and Dijkstra's algorithm (for any value of ).

A network generated with particular values of and is called a . Both of the parameters have the effect of decreasing the number of links in the network as their values are increased. The network with the minimum number of links is obtained when and , i.e., .

With ordinal-scale data (see level of measurement), the parameter should be because the same would result from any positive monotonic transformation of the proximity data. Other values of require data measured on a ratio scale. The parameter can be varied to yield the desired number of links in the network or to focus on more local relations with smaller values of .

Essentially, pathfinder networks preserve the shortest possible paths given the data. Therefore, links are eliminated when they are not on shortest paths. The will be the minimum spanning tree for the links defined by the proximity data if a unique minimum spanning tree exists. In general, the includes all of the links in any minimum spanning tree.

Example

Here is an example of an undirected pathfinder network derived from average ratings of a group of biology graduate students. The students rated the relatedness of all pairs of the terms shown, and the mean rating for each pair was computed. The solid blue links are the (labeled "both" in the figure). The dotted red links are added in the . For the added links, there are no 2-link paths shorter than the link distance but there is at least one shorter path with more than two links in the data. A minimal spanning tree would have 24 links so the 26 links in implies that there is more than one minimum spanning tree. There are two cycles present so there are tied distances in the set of links in the cycle. Breaking each cycle would require removing one of the tied links in each cycle.

Bio2+.png

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Machine learning paradigm

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

<span class="mw-page-title-main">Shortest path problem</span> Computational problem of graph theory

In graph theory, the shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

<span class="mw-page-title-main">Dijkstra's algorithm</span> Algorithm for finding shortest paths

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, a road network. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

In computer science and optimization theory, the max-flow min-cut theorem states that in a flow network, the maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in a minimum cut, i.e., the smallest total weight of the edges which if removed would disconnect the source from the sink.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of objects in a set into a configuration of points mapped into an abstract Cartesian space.

<span class="mw-page-title-main">Maximum flow problem</span> Computational problem in graph theory

In optimization theory, maximum flow problems involve finding a feasible flow through a flow network that obtains the maximum possible flow rate.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

<span class="mw-page-title-main">Euclidean minimum spanning tree</span> Shortest network connecting points

A Euclidean minimum spanning tree of a finite set of points in the Euclidean plane or higher-dimensional Euclidean space connects the points by a system of line segments with the points as endpoints, minimizing the total length of the segments. In it, any two points can reach each other along a path through the line segments. It can be found as the minimum spanning tree of a complete graph with the points as vertices and the Euclidean distances between points as edge weights.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

<span class="mw-page-title-main">X-ray reflectivity</span> Surface analytical technique

X-ray reflectivity is a surface-sensitive analytical technique used in chemistry, physics, and materials science to characterize surfaces, thin films and multilayers. It is a form of reflectometry based on the use of X-rays and is related to the techniques of neutron reflectometry and ellipsometry.

<span class="mw-page-title-main">Random geometric graph</span> In graph theory, the mathematically simplest spatial network

In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.

In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion, at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed, and marks as outliers points that lie alone in low-density regions . DBSCAN is one of the most commonly used and cited clustering algorithms.

<span class="mw-page-title-main">Smallest-circle problem</span> Finding the smallest circle that contains all given points

The smallest-circle problem is a computational geometry problem of computing the smallest circle that contains all of a given set of points in the Euclidean plane. The corresponding problem in n-dimensional space, the smallest bounding sphere problem, is to compute the smallest n-sphere that contains all of a given set of points. The smallest-circle problem was initially proposed by the English mathematician James Joseph Sylvester in 1857.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ().

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

In the data analysis of time series, Time Warp Edit Distance (TWED) is a measure of similarity between pairs of discrete time series, controlling the relative distortion of the time units of the two series using the physical notion of elasticity. In comparison to other distance measures,, TWED is a metric. Its computational time complexity is , but can be drastically reduced in some specific situations by using a corridor to reduce the search space. Its memory space complexity can be reduced to . It was first proposed in 2009 by P.-F. Marteau.

Semi-global matching (SGM) is a computer vision algorithm for the estimation of a dense disparity map from a rectified stereo image pair, introduced in 2005 by Heiko Hirschmüller while working at the German Aerospace Center. Given its predictable run time, its favourable trade-off between quality of the results and computing time, and its suitability for fast parallel implementation in ASIC or FPGA, it has encountered wide adoption in real-time stereo vision applications such as robotics and advanced driver assistance systems.

A central problem in algorithmic graph theory is the shortest path problem. One of the generalizations of the shortest path problem is known as the single-source-shortest-paths (SSSP) problem, which consists of finding the shortest paths from a source vertex to all other vertices in the graph. There are classical sequential algorithms which solve this problem, such as Dijkstra's algorithm. In this article, however, we present two parallel algorithms solving this problem.

References

  1. Schvaneveldt, R.W., ed. (1990). Pathfinder associative networks: Studies in knowledge organization. Norwood, NJ: Ablex. ISBN   978-0893916244.
  2. Schvaneveldt, R.W.; Durso, F.T.; Dearholt, D.W. (1989). "Network structures in proximity data". In Bower, G. (ed.). The psychology of learning and motivation: Advances in research and theory (PDF). Vol. 24. New York: Academic Press. pp. 249–284.
  3. Schvaneveldt, R.W.; Dearholt, D.W.; Durso, F.T. (1989). "Graph theoretic foundations of Pathfinder Networks" (PDF). Computers and Mathematics with Applications. 15 (4): 337–345. doi:10.1016/0898-1221(88)90221-0.
  4. Goldsmith, T.; Johnson, P.; Acton, W. (1991). "Assessing structural knowledge". Journal of Educational Psychology. 83 (4): 88–96. doi:10.1037/0022-0663.83.1.88.
  5. Kudikyala, U.K.; Vaughn, R.B. (2005). "Software requirement understanding using Pathfinder networks: discovering and evaluating mental models". Journal of Systems and Software. 74 (1): 101–108. doi:10.1016/j.jss.2003.09.028.
  6. Shope, S. M.; DeJoode, J. A.; Cooke, N. J.; Pedersen, H. (2004). "Using Pathfinder to Generate Communication Networks in a Cognitive Task Analysis". Proceedings of the Human Factors and Ergonomics Society Annual Meeting. 48 (3): 678–682. doi:10.1177/154193120404800386.
  7. Serrano, E.; Quinn, A.; Botia, J.A.; Cordón, O. (2010). "Debugging complex software systems by means of pathfinder networks". Information Science. 180 (5): 561–583. doi:10.1016/j.ins.2009.11.007.
  8. Chen, C.; Song (2017). "Science Mapping Tools and Applications.". Representing Scientific Knowledge. Springer. pp. 57–137.
  9. Cribbin, T. (2010). "Visualising the structure of document search results: A comparison of graph theoretic approaches". Information Visualization. 9 (2): 83–97. doi:10.1057/ivs.2009.3.

Other Information

Further information on pathfinder networks and several examples of the application of PFnets to a variety of problems can be found in the references.

Three papers describing fast implementations of pathfinder networks:

(The two variants by Quirin et al. are significantly faster. While the former can be applied with or and any value for , the latter can only be applied in cases where and .)