Citation graph

Last updated
In this example, document b cites document d, and is cited by document a. Tred-G.svg
In this example, document b cites document d, and is cited by document a.

A citation graph (or citation network), in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents.

Contents

Each vertex (or node) in the graph represents a document in the collection, and each edge is directed from one document toward another that it cites (or vice versa depending on the specific implementation). [1]

Citation graphs have been utilised in various ways, including forms of citation analysis, academic search tools and court judgements. They are predicted to become more relevant and useful in the future as the body of published research grows.

Implementation

There is no standard format for the citations in bibliographies, and the record linkage of citations can be a time-consuming and complicated process. Furthermore, citation errors can occur at any stage of the publishing process. However, there is a long history of creating citation databases, also known as citation indexes, so there is a lot of information about such problems.

In principle, each document should have a unique publication date and can only refer to earlier documents. This means that an ideal citation graph is not only directed but acyclic; that is, there are no loops in the graph. This is not always the case in practice, since an academic paper goes through several versions in the publishing process. The timing of asynchronous updates to bibliographies may lead to edges that apparently point backward in time. Such "backward" citations seem to constitute less than 1% of the total number of links. [2]

As citation links are meant to be permanent, the bulk of a citation graph should be static, and only the leading edge of the graph should change. Exceptions might occur when papers are withdrawn from circulation. [2]

Background and history

A citation is a reference to a published or unpublished source (not always the original source). More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work. Its purpose is to acknowledge the relevance of the works of others to the topic of discussion at the point where the citation appears.

Generally the combination of both the in-body citation and the bibliographic entry constitutes what is commonly thought of as a citation (whereas bibliographic entries by themselves are not). [3] References to single, machine-readable assertions in electronic scientific articles are known as nanopublications, a form of micro attributions.

Citation networks are one kind of social network that has been studied quantitatively almost from the moment citation databases first became available. In 1965, Derek J. de Solla Price described the inherent linking characteristic of the Science Citation Index (SCI) in his paper entitled "Networks of Scientific Papers." The links between citing and cited papers became dynamic when the SCI began to be published online. In 1973, Henry Small published his work on co-citation analysis, which became a self-organizing classification system that led to document clustering experiments and eventually what is called "Research Reviews." [4]

Applications

Citation Analysis

Citation graphs can be applied to measures of scholarly impact, the impact a particular paper has had on the academic world. While a hard value to quantify, scholarly impact is useful, as having a measure of scholarly impact for many papers can aid in identifying important papers. It can also provide a measure of the relevance of a particular academic community. Citation graphs are very useful in measuring this as the number of connections on the citation graph corresponds with the scholarly impact of an article, as this means it has been cited by many other papers. [5]

Similarity analysis is another area of citation analysis which frequently makes uses of citation graphs. The relationship between two papers in the citation graph has been compared to their text-based similarity, and it is found that closeness in the citation graph can predict a level of text-based similarity. [6] Additionally, it has been found that the two methods – citation graph closeness and traditional content-based similarity – work well in conjunction to produce a more accurate result. [6]

Analyses of citation graphs have also led to the proposal of the citation graph as a way to identify different communities and research areas within the academic world. It has been found that analysing the citation graph for groups of documents in conjunction with keywords can provide an accurate way to identify clusters of similar research. [7] In a similar vein, a way of identifying the main “stream” of an area of research, or the progression of a research idea over time can be identified by using depth first search algorithms on the citation graph. Instead of looking at similarity between two nodes, or clusters of many nodes, this method instead goes through the links between nodes to trace a research idea back to its beginning, and so discover its progression through different papers to where its current status is. [8]

Search Tools

The traditional method used by academic search tools is to check for matches between a search term and keywords in papers to return potential matches. While mostly effective, this method can lead to errors where a paper is recommended from a different discipline because of keyword matches even when the two topics actually have little in common.

Many have argued that this way of searching for relevant papers could be improved and made more accurate if citation graphs were incorporated into academic paper search tools. For example, one system was proposed which used both the keyword system and a popularity system based on how many connections a paper had in the citation graph. In this system, more connected papers were considered more popular and therefore given a higher weighting in the paper recommendation system. [9]

In more recent years, visual search tools have been developed which use citation graphs to provide a visual representation of the connections between papers. A notable pioneer in this concept is the search tool Connected Papers, which began as a small project between friends and was released to the public in 2020. Given one academic paper, it analyses tens of thousands of other papers, and selecting all those relevant to the origin paper creates a citation graph and returns a visual representation of it to the viewer. This unique way of looking at research allows the viewer to see an entire area of research at a glance and can greatly aid in understanding the state of a research area and quickly identifying key papers that have lots of connections.[ citation needed ]

Court Judgements

Citation graphs have a history of being used to aid in organising and mapping citations of legal documents. In a similar way to the aforementioned search tools, constructions of citation graphs specific to the types of citations found in legal documents have been used to allow relevant past legal documents to be found when needed for a court decision. As a way of replacing or improving upon traditional search methods, this citation graph aided way of organising legal documents can provide higher efficiency, accuracy, and organisation. [10]

There are several other types of network graphs that are closely related to citation networks. The co-citation graph is the graph between documents as nodes, where two documents are connected if they share a common citation (see Co-citation and Bibliographic coupling). Other related networks are formed using other information present in the document. For instance, in a collaboration graph, known in this context as a co-authorship network, the nodes are the authors of documents, linked if they have co-authored the same document. The link weights between two authors in co-authorship networks can increase over time if they have further collaboration.

Future Developments

While citation graphs have had a noticeable impact on several areas of academia, they are likely to become more relevant in the future. As the body of published research grows, more traditional ways of searching for papers will become less effective in narrowing down relevant papers to a particular topic. For example, text-based similarity can only go so far in selecting which papers are relevant to a topic, whereas the addition of citation graphs could make use of giving higher priority to those papers which have a lot of connections to other papers relevant to the topic.

However, developments like this face similar challenges to that of most applications of citation graphs, which is the face that there is no standardized format or way of citing. This makes the construction of these graphs very difficult, since it requires complex software analysis to extract citations from papers. One solution proposed to this problem is to create open databases of citation information in a format which could be used by anyone and easily converted to a different form, for example a citation graph. [11]

See also

Related Research Articles

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">Citation</span> Reference to a source

A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.

<span class="mw-page-title-main">Scientific citation</span>

Scientific citation is providing detailed reference in a scientific publication, typically a paper or book, to previous published communications which have a bearing on the subject of the new publication. The purpose of citations in original work is to allow readers of the paper to refer to cited work to assist them in judging the new work, source background information vital for future development, and acknowledge the contributions of earlier workers. Citations in, say, a review paper bring together many sources, often recent, in one place.

<span class="mw-page-title-main">Social network analysis</span> Analysis of social structures using network and graph theory

Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes and the ties, edges, or links that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, meme spread, information circulation, friendship and acquaintance networks, peer learner networks, business networks, knowledge networks, difficult working relationships, collaboration graphs, kinship, disease transmission, and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. These visualizations provide a means of qualitatively assessing networks by varying the visual representation of their nodes and edges to reflect attributes of interest.

<span class="mw-page-title-main">Network theory</span> Study of graphs as a representation of relations between discrete objects

In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the nodes or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.

<span class="mw-page-title-main">Bibliometrics</span> Statistical analysis of written publications

Bibliometrics is the use of statistical methods to analyse books, articles and other publications, especially in scientific contents. Bibliometric methods are frequently used in the field of library and information science. Bibliometrics is closely associated with scientometrics, the analysis of scientific metrics and indicators, to the point that both fields largely overlap.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Google Scholar</span> Academic search service by Google

Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.

Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter.

Keyword research and optimization

<span class="mw-page-title-main">Web of Science</span> Online subscription index of citations

The Web of Science is a paid-access platform that provides access to multiple databases that provide reference and citation data from academic journals, conference proceedings, and other documents in various academic disciplines. Until 1997, it was originally produced by the Institute for Scientific Information. It is currently owned by Clarivate.

<span class="mw-page-title-main">Review article</span> Summary of the understanding on a topic

A review article is an article that summarizes the current state of understanding on a topic within a certain discipline. A review article is generally considered a secondary source since it may analyze and discuss the method and conclusions in previously published studies. It resembles a survey article or, in news publishing, overview article, which also surveys and summarizes previously published primary and secondary sources, instead of reporting new facts and results. Survey articles are however considered tertiary sources, since they do not provide additional analysis and synthesis of new conclusions. A review of such sources is often referred to as a tertiary review.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

AMiner is a free online service used to index, search, and mine big scientific data.

A content discovery platform is an implemented software recommendation platform which uses recommender system tools. It utilizes user metadata in order to discover and recommend appropriate content, whilst reducing ongoing maintenance and development costs. A content discovery platform delivers personalized content to websites, mobile devices and set-top boxes. A large range of content discovery platforms currently exist for various forms of content ranging from news articles and academic journal articles to television. As operators compete to be the gateway to home entertainment, personalized television is a key service differentiator. Academic content discovery has recently become another area of interest, with several companies being established to help academic researchers keep up to date with relevant academic content and serendipitously discover new content.

<span class="mw-page-title-main">NodeXL</span> Network analysis and visualization package for Microsoft Excel

NodeXL is a network analysis and visualization software package for Microsoft Excel 2007/2010/2013/2016. The package is similar to other network visualization tools such as Pajek, UCINet, and Gephi. It is widely applied in ring, mapping of vertex and edge, and customizable visual attributes and tags. NodeXL enables researchers to undertake social network analysis work metrics such as centrality, degree, and clustering, as well as monitor relational data and describe the overall relational network structure. When applied to Twitter data analysis, it showed the total network of all users participating in public discussion and its internal structure through data mining. It allows social Network analysis (SNA) to emphasize the relationships rather than the isolated individuals or organizations, allowing interested parties to investigate the two-way dialogue between organizations and the public. SNA also provides a flexible measurement system and parameter selection to confirm the influential nodes in the network, such as in-degree and out-degree centrality. The software contains network visualization, social network analysis features, access to social media network data importers, advanced network metrics, and automation.

Author-level metrics are citation metrics that measure the bibliometric impact of individual authors, researchers, academics, and scholars. Many metrics have been developed that take into account varying numbers of factors.

Semantic Scholar is an artificial intelligence–powered research tool for scientific literature developed at the Allen Institute for AI and publicly released in November 2015. It uses advances in natural language processing to provide summaries for scholarly papers. The Semantic Scholar team is actively researching the use of artificial intelligence in natural language processing, machine learning, human–computer interaction, and information retrieval.

<span class="mw-page-title-main">Main path analysis</span> Mathematical tool

Main path analysis is a mathematical tool, first proposed by Hummon and Doreian in 1989, to identify the major paths in a citation network, which is one form of a directed acyclic graph (DAG). It has since become an effective technique for mapping technological trajectories, exploring scientific knowledge flows, and conducting literature reviews.

References

  1. Egghe, Leo; Rousseau, Ronald (1990). Introduction to Informetrics : quantitative methods in library, documentation and information science. Amsterdam, the Netherlands: Elsevier Science Publishers. p. 228. ISBN   0-444-88493-9.
  2. 1 2 James R Clough; Jamie Gollings; Tamar V Loach; Tim S Evans (2015). "Transitive reduction of citation networks". Journal of Complex Networks. 3 (2): 189–203. arXiv: 1310.8224 . doi:10.1093/comnet/cnu039. S2CID   10228152.
  3. Zhao, Dangzhi; Strotmann, Andreas (2015-02-01). Analysis and Visualization of Citation Networks. Morgan & Claypool Publishers. ISBN   978-1-60845-939-1.
  4. Structures and Statistics of Citation Networks, Miray Kas
  5. Życzkowski, Karol (2010-10-01). "Citation graph, weighted impact factors and performance indices". Scientometrics. 85 (1): 301–315. arXiv: 0904.2110 . doi:10.1007/s11192-010-0208-6. ISSN   1588-2861. S2CID   7614954.
  6. 1 2 Lu, Wangzhong; Janssen, J.; Milios, E.; Japkowicz, N.; Zhang, Yongzheng (2007-01-01). "Node similarity in the citation graph". Knowledge and Information Systems. 11 (1): 105–129. doi:10.1007/s10115-006-0023-9. ISSN   0219-3116. S2CID   26234247.
  7. Bolelli, Levent; Ertekin, Seyda; Giles, C. Lee (2006). "Clustering Scientific Literature Using Sparse Citation Graph Analysis". In Fürnkranz, Johannes; Scheffer, Tobias; Spiliopoulou, Myra (eds.). Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science. Vol. 4213. Berlin, Heidelberg: Springer. pp. 30–41. doi: 10.1007/11871637_8 . ISBN   978-3-540-46048-0. S2CID   15527080.
  8. Hummon, Norman P.; Dereian, Patrick (1989-03-01). "Connectivity in a citation network: The development of DNA theory". Social Networks. 11 (1): 39–63. doi:10.1016/0378-8733(89)90017-8. ISSN   0378-8733.
  9. Liu, Hanwen; Kou, Huaizhen; Yan, Chao; Qi, Lianyong (2020-04-24). "Keywords-Driven and Popularity-Aware Paper Recommendation Based on Undirected Paper Citation Graph". Complexity. 2020: e2085638. doi: 10.1155/2020/2085638 . ISSN   1076-2787.
  10. Sadeghian, Ali; Sundaram, Laksshman; Wang, Daisy Zhe; Hamilton, William F.; Branting, Karl; Pfeifer, Craig (June 2018). "Automatic semantic edge labeling over legal citation graphs". Artificial Intelligence and Law. 26 (2): 127–144. doi:10.1007/s10506-018-9217-1. ISSN   0924-8463. S2CID   254266762.
  11. Lauscher, Anne; Eckert, Kai; Galke, Lukas; Scherp, Ansgar; Rizvi, Syed Tahseen Raza; Ahmed, Sheraz; Dengel, Andreas; Zumstein, Philipp; Klein, Annette (2018-05-23). "Linked Open Citation Database". Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (PDF). JCDL '18. New York, NY, USA: Association for Computing Machinery. pp. 109–118. doi:10.1145/3197026.3197050. ISBN   978-1-4503-5178-2. S2CID   4902279.

Further reading