Main path analysis

Last updated

Main path analysis is a mathematical tool, first proposed by Hummon and Doreian in 1989, [1] to identify the major paths in a citation network, which is one form of a directed acyclic graph (DAG). It has since become an effective technique for mapping technological trajectories, exploring scientific knowledge flows, and conducting literature reviews.

Contents

Main path analysis uncovers the most significant paths, or citation chains, in a citation network. The figure shows the global key-route main paths (in red) for a sample citation network (based on search path count and at key-route 1). Global key-route main paths for a citation network.svg
Main path analysis uncovers the most significant paths, or citation chains, in a citation network. The figure shows the global key-route main paths (in red) for a sample citation network (based on search path count and at key-route 1).

The method begins by measuring the significance of all the links in a citation network through the concept of ‘traversal count’ and then sequentially chains the most significant links into a "main path", which is deemed the most significant historical path in the target citation network. The method is applicable to any human activity that can be organized in the form of a citation network. The method is commonly applied to trace the knowledge flow paths or development trajectories of a science or technology field, through bibliographic citations or patent citations. [2] [3] [4] It has also been applied to judicial decisions to trace the evolving changes of legal opinions. [5] Main path analysis has attracted scholars attention recently. Academic research related to main path analysis saw a fast growing since 2007. A list of academic articles that introduce, explain, apply, modify, or extend the method originated in Hummon and Doreian [1] can be found here. Nevertheless, there are issues not broadly discussed in applying the method, including the handling of citation data, choosing a proper traversal weight scheme, search options, and interpretation of the resulting paths. [6]

History

Main path analysis is first proposed in Hummon and Doreian (1989) [1] in which they suggest a different approach for analyzing a citation network "where the connective threads through a network are preserved and the focus is on the links in the network rather than on the nodes." [1] They call the resulting chain of the most used citation links "main path" and claim that "It is our intuition that the main path, selected on the basis of the most used path will identify the main stream of a literature." The idea was verified using a set of DNA research articles. To make the method more practical, Liu and Lu (2012) [7] extends the method to include the key-route search. The most useful feature of the key-route search is that one is able to view the different level of main paths by adjusting the key-route numbers.

The method

Main path analysis operates in two steps. The first step obtains the traversal counts of each link in a citation network. Several types of traversal counts are mentioned in the literature. The second step searches for the main paths by linking the significant links according to the size of traversal counts. One needs to prepare a citation network before proceeding for main path analysis.

Preparing a citation network

It is necessary to prepare a citation network before starting main path analysis. In a citation network, the nodes represent the documents such as academic articles, patents, or legal cases. These nodes are connected using citation information. Citation networks are by nature directed because the two nodes on the opposite end of a link are not symmetrical in their roles. As regards to the direction, this article adopts the convention that the cited node points to the citing node, signifying the fact that knowledge in the cited node flows to the citing node. Citation network is also by nature acyclic, which means that a node can never chain back to itself if one moves along the links following their direction.

Several terms related to a citation network are defined here before proceeding further. Heads are the nodes the direction arrow leads to. Tails are the nodes on other ends of the direction arrow. Sources are the nodes that are cited but cite no others. Sinks cite other nodes but are not cited. Ancestors are the nodes that can be traced back to from a target node. Descendants are the nodes that one can reach from a target if one moves along the links following their direction.

Figure 1. SPC values for a sample citation network SPC values for a citation network.png
Figure 1. SPC values for a sample citation network

Traversal counts

Traversal counts measure the significance of a link. The literature discusses several types of traversal counts, including search path count (SPC), search path link count (SPLC), search path node pair (SPNP), and other variations. [8] All these traversal counts will be noted as SPX.

Figure 2. SPLC values for a sample citation network SPLC values for a citation network.png
Figure 2. SPLC values for a sample citation network

Search path count (SPC)

A link’s SPC is the number of times the link is traversed if one runs through all possible paths from all the sources to all the sinks. SPC is first proposed by Vladimir Batagelj. [9] SPC values for each link in a sample citation network is shown in Figure 1. The SPC value for the link (B, D) is 5 because five paths (B-D-F-H-K, B-D-F-I-L, B-D-F-I-M-N, B-D-I-L, and B-D-I-M-N) traverse through it.

Figure 3. SPNP values for a sample citation network SPNP values for a citation network.png
Figure 3. SPNP values for a sample citation network

A link’s SPLC is the number of times the link is traversed if one runs through all possible paths from all the ancestors of the tail node (including itself) to all the sinks. SPLC is first proposed by Hummon and Doreian. [1] Figure 2 presents the SPLC values for each link in the same citation network as shown in Figure 1. Six paths traverse through the link (D, F) thus give it the SPLC value 6. They are: B-D-F-H-K, B-D-F-I-L, B-D-F-I-M-N, D-F-H-K, D-F-I-L, and D-F-I-M-N, noting that all the paths begin either from the ancestor of D, which is B, and D itself.

Search path node pair (SPNP)

A link’s SPNP is the number of times the link is traversed if one runs through all possible paths from all the ancestors of the tail node (including itself) to all the descendants of the head node (including itself). SPNP is first proposed by Hummon and Doreian. [1] The SPNP values of the link (C, H) is 6 because there are 6 paths that begin from A, B, C (A and B are C's ancestors) and end at H and K (K is H's descendant). These paths are A-C-H, A-C-H-K, B-C-H, B-C-H-K, C-H, and C-H-K.

Figure 4. Local main paths in a sample citation network Local main paths SPC.png
Figure 4. Local main paths in a sample citation network

Based on the traversal counts, one can then search for the most significant path(s). There are several ways of finding them, including local, global, and key-route search.

Figure 5. Global main paths in a sample citation network Global main paths SPC.png
Figure 5. Global main paths in a sample citation network

Local search is mentioned in Hummon and Doreian [1] as "priority first" search. This search process always chooses the next link(s) with the highest SPX as the outgoing link. It keeps tracking the most traversed link(s) thus obtains the main stream among all citation chains. Figure 4 shows the local main paths that are obtained based on SPC. Noticing that when the search reaches the node I, two outgoing links have the same SPC values thus producing two paths afterward.

Figure 6. Local key-route main paths in a sample citation network Local key-route main paths SPC.png
Figure 6. Local key-route main paths in a sample citation network

Global search simply suggests the citation chain with the largest overall SPX. The concept of global search is similar to the critical path method in project scheduling. The global main paths of the sample citation network based on SPC is presented in Figure 5. The sum of all the SPC values in the path B-D-F-I-M-N is 15, which is the largest among all possible paths.

Figure 7. Global key-route main paths in a sample citation network Global key-route main paths SPC.png
Figure 7. Global key-route main paths in a sample citation network

Key-route search is designed to avoid the problem of missing significant links in both the local and global search. The problem is in the local and global main paths shown above, in which one of the most important links (H, K) is not included in the main paths. As described in Liu and Lu (2012), [7] the approach searches main paths from the specified links (key-routes) thus guarantees the inclusion of the links. One can also specify multiple links to obtain multiple main paths. An additional advantage of the key-route approach is that one is able to control the detail of the main paths by varying the number of key-routes. The larger the number of key-route is specified, the more detail is revealed. When the number of key-route increases to a certain point the search returns the whole citation network. Figure 6 and 7 show the local key-route and global key-route main paths of the sample citation network. In both main paths the number of key-route is set to 1, i.e., doing the search base on only the top links. Since there are two top links (B, D) and (H, K), the resulting main paths include both of them.

The Variants

In addition to the key-route search approach, variations of the method include the approach that is aggregative and stochastic, [10] considers decay in knowledge diffusion, [8] etc.

Applications

The method has been applied to three types of documenting system that maintain the tradition of making references to the previous documents. They are the academic article, patent, and judicial documenting system.

Academic article

Academic citation databases such as Web of Science and Scopus include comprehensive digitized citation information. These information make it possible to apply main path analysis to examine the knowledge structure or trace the knowledge flow of any scientific fields. Some early applications explores the subject of centrality-productivity, [11] conflict resolution, [12] etc. More recent applications include fullerenes, [4] nanotubes, [4] data envelopment analysis, [2] [13] [14] supply chain management, [15] corporate social responsibility, [16] IT outsourcing, [17] medical tourism, [18] etc.

Patent

Patents referencing prior arts is a common practice. For example, each United States patent document includes a "References Cited" section that lists the prior arts of the patent. Patent databases such as Clarivate Analytics and Webpat provide digitized patent citation information. Verspagen (2007) [3] and Mina (2007) [19] are the two early works that apply main path analysis to the patent data.

Judicial document

In the common law system, a court decision document usually references previously published opinions for the purpose of justifying the current decision. These judicial references, or legal citations, can also be used to construct citation networks and then tracing the changes of legal opinions. Research opportunity in this area is wide open. Liu et al. (2014) [5] conducted an exploratory study on such type of application.

Software Implementation

Main path analysis is implemented in Pajek, a widely used social network analysis software written by Vladimir Batagelj and Andrej Mrvar of University of Ljubljana, Slovenia. To run main path analysis in Pajek, one needs to first prepare a citation network and have Pajek reads in the network. Next, in the Pajek main menu, computes the traversal counts of all links in the network applying one of the following command sequences (depending on the choice of traversal counts).

Network → Acyclic Network → Create Weighted Network + Vector → Traversal Weights → Search Path Link Count (SPC), or

Network → Acyclic Network → Create Weighted Network + Vector → Traversal Weights → Search Path Link Count (SPLC), or

Network → Acyclic Network → Create Weighted Network + Vector → Traversal Weights → Search Path Node Pairs (SPNP)

After traversal counts are computed, the following command sequences find the main paths.

For local main paths

Network → Acyclic Network → Create (Sub)Network → Main Paths → Local Search → Forward

For global main paths

Network → Acyclic Network → Create (Sub)Network → Main Paths → Global Search → Standard

For local key-route main paths

Network → Acyclic Network → Create (Sub)Network → Main Paths → Local Search → Key-Route

For global key-route main paths

Network → Acyclic Network → Create (Sub)Network → Main Paths → Global Search → Key-Route

In addition to key-route search, a more flexible search feature is added starting from Pajek version 5.03 (January 4, 2018). The new feature allows for local and global search passing through vertices defined by a cluster. The command sequences are as follows:

Network → Acyclic Network → Create (Sub)Network → Main Paths → Local Search → Key-Route → Through Vertices in Cluster

Network → Acyclic Network → Create (Sub)Network → Main Paths → Global Search → Key-Route → Through Vertices in Cluster

Related Research Articles

Routing is the process of selecting a path for traffic in a network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit-switched networks, such as the public switched telephone network (PSTN), and computer networks, such as the Internet.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Dijkstra's algorithm</span> Graph search algorithm

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a weighted graph, which may represent, for example, road networks. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms that can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

Link-state routing protocols are one of the two main classes of routing protocols used in packet switching networks for computer communications, the others being distance-vector routing protocols. Examples of link-state routing protocols include Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (IS-IS).

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). It is one of several forms of causal notation. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

<span class="mw-page-title-main">Directed acyclic graph</span> Directed graph with no directed cycles

In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop. A directed graph is a DAG if and only if it can be topologically ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions. DAGs have numerous scientific and computational applications, ranging from biology to information science to computation (scheduling).

In computer science, a skip list is a probabilistic data structure that allows average complexity for search as well as average complexity for insertion within an ordered sequence of elements. Thus it can get the best features of a sorted array while maintaining a linked list-like structure that allows insertion, which is not possible with a static array. Fast search is made possible by maintaining a linked hierarchy of subsequences, with each successive subsequence skipping over fewer elements than the previous one. Searching starts in the sparsest subsequence until two consecutive elements have been found, one smaller and one larger than or equal to the element searched for. Via the linked hierarchy, these two elements link to elements of the next sparsest subsequence, where searching is continued until finally searching in the full sequence. The elements that are skipped over may be chosen probabilistically or deterministically, with the former being more common.

<span class="mw-page-title-main">Onion routing</span> Technique for anonymous communication over a computer network

Onion routing is a technique for anonymous communication over a computer network. In an onion network, messages are encapsulated in layers of encryption, analogous to the layers of an onion. The encrypted data is transmitted through a series of network nodes called "onion routers," each of which "peels" away a single layer, revealing the data's next destination. When the final layer is decrypted, the message arrives at its destination. The sender remains anonymous because each intermediary knows only the location of the immediately preceding and following nodes. While onion routing provides a high level of security and anonymity, there are methods to break the anonymity of this technique, such as timing analysis.

<span class="mw-page-title-main">Network theory</span> Study of graphs as a representation of relations between discrete objects

In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the nodes or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

The Temporally Ordered Routing Algorithm (TORA) is an algorithm for routing data across Wireless Mesh Networks or Mobile ad hoc networks.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

Pipeline forwarding (PF) applies to packet forwarding in computer networks the basic concept of pipelining, which has been widely and successfully used in computing — specifically, in the architecture of all major central processing units (CPUs) — and manufacturing — specifically in assembly lines of various industries starting from automotive to many others. Pipelining is known to be optimal independent of the specific instantiation. In particular, PF is optimal from various points of view:

  1. High efficiency in utilization of network resources, which enables accommodating a larger amount of traffic on the network, thus lowering operation cost and being the foundation for accommodating the exponential growth of modern networks.
  2. Low implementation complexity, which enables the realization of larger and more powerful networking systems at low cost, thus offering further support to network growth.
  3. High scalability, which is an immediate consequence of the above two features.
  4. Deterministic and predictable operation with minimum delay and no packet loss even under full load condition, which is key in supporting the demanding requirements of the new and valuable services that are being deployed, or envisioned to be deployed, on modern networks, such as telephony, videoconferencing, virtual presence, video on demand, distributed gaming.
<span class="mw-page-title-main">Citation graph</span> Directed graph describing citations in documents

A citation graph, in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents.

Segment protection is a type of backup technique that can be used in most networks. It can be implemented as a dedicated backup or as a shared backup protection. Overlapping segments and non-overlapping segments are allowed; each providing different advantages.

<span class="mw-page-title-main">Blockmodeling</span> Analytical method for social structure

Blockmodeling is a set or a coherent framework, that is used for analyzing social structure and also for setting procedure(s) for partitioning (clustering) social network's units, based on specific patterns, which form a distinctive structure through interconnectivity. It is primarily used in statistics, machine learning and network science.

Patrick Doreian is an American mathematician and social scientist, whose specialty is network analysis. His specific research interests include blockmodeling, social structure and network processes.

References

  1. 1 2 3 4 5 6 7 Hummon, Norman P.; Doreian, Patrick (1989). "Connectivity in a citation network: The development of DNA theory". Social Networks. 11 (1): 39–63. doi:10.1016/0378-8733(89)90017-8.
  2. 1 2 Liu, John S.; Lu, Louis Y.Y.; Lu, Wen-Min; Lin, Bruce J.Y. (2013). "Data envelopment analysis 1978–2010: A citation-based literature survey". Omega. 41 (1): 3–15. doi:10.1016/j.omega.2010.12.006.
  3. 1 2 Verspagen, Bart (2007-03-01). "Mapping technological trajectories as patent citation networks: a study on the history of fuel cell research". Advances in Complex Systems. 10 (1): 93–115. doi:10.1142/S0219525907000945. ISSN   0219-5259.
  4. 1 2 3 Lucio-Arias, Diana; Leydesdorff, Loet (2008-10-01). "Main-path analysis and path-dependent transitions in HistCite™-based historiograms". Journal of the American Society for Information Science and Technology. 59 (12): 1948–1962. doi:10.1002/asi.20903. ISSN   1532-2890.
  5. 1 2 Liu, John S.; Chen, Hsiao-Hui; Ho, Mei Hsiu-Ching; Li, Yu-Chen (2014-12-01). "Citations with different levels of relevancy: Tracing the main paths of legal opinions". Journal of the Association for Information Science and Technology. 65 (12): 2479–2488. doi:10.1002/asi.23135. ISSN   2330-1643.
  6. Liu, John S.; Lu, Louis Y. Y.; Ho, Mei Hsiu-Ching (2019-04-01). "A few notes on main path analysis". Scientometrics. 119 (1): 379–391. doi: 10.1007/s11192-019-03034-x . ISSN   1588-2861.
  7. 1 2 Liu, John S.; Lu, Louis Y.Y. (2012-03-01). "An integrated approach for main path analysis: Development of the Hirsch index as an example". Journal of the American Society for Information Science and Technology. 63 (3): 528–542. doi:10.1002/asi.21692. ISSN   1532-2890.
  8. 1 2 Liu, John S.; Kuan, Chung-Huei (2016-02-01). "A new approach for main path analysis: Decay in knowledge diffusion". Journal of the Association for Information Science and Technology. 67 (2): 465–476. doi:10.1002/asi.23384. ISSN   2330-1643.
  9. Batagelj, V. (2003). Efficient algorithms for citation network analysis. arXiv preprint cs/0309023.
  10. Yeo, Woondong; Kim, Seonho; Lee, Jae-Min; Kang, Jaewoo (2014-01-01). "Aggregative and stochastic model of main path identification: a case study on graphene". Scientometrics. 98 (1): 633–655. doi:10.1007/s11192-013-1140-3. ISSN   0138-9130.
  11. Hummon, Norman P.; Doreian, Patrick; Freeman, Linton C. (2016-08-18). "Analyzing the Structure of the Centrality-Productivity Literature Created Between 1948 and 1979". Knowledge. 11 (4): 459–480. doi:10.1177/107554709001100405.
  12. Carley, Kathleen M.; Hummon, Norman P.; Harty, Martha (2016-08-17). "Scientific Influence". Knowledge. 14 (4): 417–447. doi:10.1177/107554709301400406.
  13. Liu, John S.; Lu, Louis Y.Y.; Lu, Wen-Min (2016). "Research fronts in data envelopment analysis". Omega. 58: 33–45. doi:10.1016/j.omega.2015.04.004.
  14. Liu, John S.; Lu, Louis Y.Y.; Lu, Wen-Min; Lin, Bruce J.Y. (2013). "A survey of DEA applications". Omega. 41 (5): 893–902. doi:10.1016/j.omega.2012.11.004.
  15. Claudia Colicchia; Fernanda Strozzi (2012-06-15). "Supply chain risk management: a new methodology for a systematic literature review". Supply Chain Management. 17 (4): 403–418. doi:10.1108/13598541211246558. ISSN   1359-8546.
  16. Lu, Louis Y.Y.; Liu, John S. (2014-03-01). "The Knowledge Diffusion Paths of Corporate Social Responsibility – From 1970 to 2011". Corporate Social Responsibility and Environmental Management. 21 (2): 113–128. doi:10.1002/csr.1309. ISSN   1535-3966.
  17. Liang, Huigang; Wang, Jian-Jun; Xue, Yajiong; Cui, Xiaocong (2016). "IT outsourcing research from 1992 to 2013: A literature review based on main path analysis". Information & Management. 53 (2): 227–251. doi:10.1016/j.im.2015.10.001.
  18. Chuang, Thomas C.; Liu, John S.; Lu, Louis Y.Y.; Lee, Yachi (2014). "The main paths of medical tourism: From transplantation to beautification". Tourism Management. 45: 49–58. doi:10.1016/j.tourman.2014.03.016.
  19. Mina, A.; Ramlogan, R.; Tampubolon, G.; Metcalfe, J.S. (2007). "Mapping evolutionary trajectories: Applications to the growth and transformation of medical knowledge". Research Policy. 36 (5): 789–806. doi:10.1016/j.respol.2006.12.007.