Citation graph

Last updated
In this example, document b cites document d, and is cited by document a. Tred-G.svg
In this example, document b cites document d, and is cited by document a.

A citation graph (or citation network), in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents.

Contents

Each vertex (or node) in the graph represents a document in the collection, and each edge is directed from one document toward another that it cites (or vice versa depending on the specific implementation). [1]

Citation graphs have been utilised in various ways, including forms of citation analysis, academic search tools and court judgements. They are predicted to become more relevant and useful in the future as the body of published research grows.

Implementation

There is no standard format for the citations in bibliographies, and the record linkage of citations can be a time-consuming and complicated process. Furthermore, citation errors can occur at any stage of the publishing process. However, there is a long history of creating citation databases, also known as citation indexes, so there is a lot of information about such problems.

In principle, each document should have a unique publication date and can only refer to earlier documents. This means that an ideal citation graph is not only directed but acyclic; that is, there are no loops in the graph. This is not always the case in practice, since an academic paper goes through several versions in the publishing process. The timing of asynchronous updates to bibliographies may lead to edges that apparently point backward in time. Such "backward" citations seem to constitute less than 1% of the total number of links. [2]

As citation links are meant to be permanent, the bulk of a citation graph should be static, and only the leading edge of the graph should change. Exceptions might occur when papers are withdrawn from circulation. [2]

Background and history

A citation is a reference to a published or unpublished source (not always the original source). More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work. Its purpose is to acknowledge the relevance of the works of others to the topic of discussion at the point where the citation appears.

Generally the combination of both the in-body citation and the bibliographic entry constitutes what is commonly thought of as a citation (whereas bibliographic entries by themselves are not). [3] References to single, machine-readable assertions in electronic scientific articles are known as nanopublications, a form of micro attributions.

Citation networks are one kind of social network that has been studied quantitatively almost from the moment citation databases first became available. In 1965, Derek J. de Solla Price described the inherent linking characteristic of the Science Citation Index (SCI) in his paper entitled "Networks of Scientific Papers." The links between citing and cited papers became dynamic when the SCI began to be published online. In 1973, Henry Small published his work on co-citation analysis, which became a self-organizing classification system that led to document clustering experiments and eventually what is called "Research Reviews." [4]

Applications

Citation Analysis

Citation graphs can be applied to measures of scholarly impact, the impact a particular paper has had on the academic world. While a hard value to quantify, scholarly impact is useful, as having a measure of scholarly impact for many papers can aid in identifying important papers. It can also provide a measure of the relevance of a particular academic community. Citation graphs are very useful in measuring this as the number of connections on the citation graph corresponds with the scholarly impact of an article, as this means it has been cited by many other papers. [5]

Similarity analysis is another area of citation analysis which frequently makes uses of citation graphs. The relationship between two papers in the citation graph has been compared to their text-based similarity, and it is found that closeness in the citation graph can predict a level of text-based similarity. [6] Additionally, it has been found that the two methods – citation graph closeness and traditional content-based similarity – work well in conjunction to produce a more accurate result. [6]

Analyses of citation graphs have also led to the proposal of the citation graph as a way to identify different communities and research areas within the academic world. It has been found that analysing the citation graph for groups of documents in conjunction with keywords can provide an accurate way to identify clusters of similar research. [7] In a similar vein, a way of identifying the main “stream” of an area of research, or the progression of a research idea over time can be identified by using depth first search algorithms on the citation graph. Instead of looking at similarity between two nodes, or clusters of many nodes, this method instead goes through the links between nodes to trace a research idea back to its beginning, and so discover its progression through different papers to where its current status is. [8]

Search Tools

The traditional method used by academic search tools is to check for matches between a search term and keywords in papers to return potential matches. While mostly effective, this method can lead to errors where a paper is recommended from a different discipline because of keyword matches even when the two topics actually have little in common.

Many have argued that this way of searching for relevant papers could be improved and made more accurate if citation graphs were incorporated into academic paper search tools. For example, one system was proposed which used both the keyword system and a popularity system based on how many connections a paper had in the citation graph. In this system, more connected papers were considered more popular and therefore given a higher weighting in the paper recommendation system. [9]

In more recent years, visual search tools have been developed which use citation graphs to provide a visual representation of the connections between papers. A commercial implementation of this concept is the search tool Connected Papers.[ citation needed ]

Court Judgements

Citation graphs have a history of being used to aid in organising and mapping citations of legal documents. In a similar way to the aforementioned search tools, constructions of citation graphs specific to the types of citations found in legal documents have been used to allow relevant past legal documents to be found when needed for a court decision. As a way of replacing or improving upon traditional search methods, this citation graph aided way of organising legal documents can provide higher efficiency, accuracy, and organisation. [10]

There are several other types of network graphs that are closely related to citation networks. The co-citation graph is the graph between documents as nodes, where two documents are connected if they share a common citation (see Co-citation and Bibliographic coupling). Other related networks are formed using other information present in the document. For instance, in a collaboration graph, known in this context as a co-authorship network, the nodes are the authors of documents, linked if they have co-authored the same document. The link weights between two authors in co-authorship networks can increase over time if they have further collaboration.

Future Developments

While citation graphs have had a noticeable impact on several areas of academia, they are likely to become more relevant in the future. As the body of published research grows, more traditional ways of searching for papers will become less effective in narrowing down relevant papers to a particular topic. For example, text-based similarity can only go so far in selecting which papers are relevant to a topic, whereas the addition of citation graphs could make use of giving higher priority to those papers which have a lot of connections to other papers relevant to the topic.

However, developments like this face similar challenges to that of most applications of citation graphs, which is the face that there is no standardized format or way of citing. This makes the construction of these graphs very difficult, since it requires complex software analysis to extract citations from papers. One solution proposed to this problem is to create open databases of citation information in a format which could be used by anyone and easily converted to a different form, for example a citation graph. [11]

See also

Related Research Articles

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">Citation</span> Reference to a source

A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.

<span class="mw-page-title-main">Scientific citation</span>

Scientific citation is providing detailed reference in a scientific publication, typically a paper or book, to previous published communications which have a bearing on the subject of the new publication. The purpose of citations in original work is to allow readers of the paper to refer to cited work to assist them in judging the new work, source background information vital for future development, and acknowledge the contributions of earlier workers. Citations in, say, a review paper bring together many sources, often recent, in one place.

<span class="mw-page-title-main">Social network analysis</span> Analysis of social structures using network and graph theory

Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes and the ties, edges, or links that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, meme spread, information circulation, friendship and acquaintance networks, peer learner networks, business networks, knowledge networks, difficult working relationships, collaboration graphs, kinship, disease transmission, and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. These visualizations provide a means of qualitatively assessing networks by varying the visual representation of their nodes and edges to reflect attributes of interest.

<span class="mw-page-title-main">Network theory</span> Study of graphs as a representation of relations between discrete objects

In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the vertices or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.

<span class="mw-page-title-main">Bibliometrics</span> Statistical analysis of written publications

Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics to the point that both fields largely overlap.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Google Scholar</span> Academic search service by Google

Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter.

The h-index is an author-level metric that measures both the productivity and citation impact of the publications, initially used for an individual scientist or scholar. The h-index correlates with success indicators such as winning the Nobel Prize, being accepted for research fellowships and holding positions at top universities. The index is based on the set of the scientist's most cited papers and the number of citations that they have received in other publications. The index has more recently been applied to the productivity and impact of a scholarly journal as well as a group of scientists, such as a department or university or country. The index was suggested in 2005 by Jorge E. Hirsch, a physicist at UC San Diego, as a tool for determining theoretical physicists' relative quality and is sometimes called the Hirsch index or Hirsch number.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

AMiner is a free online service used to index, search, and mine big scientific data.

In network theory, link analysis is a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes (100k), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity, computer security analysis, search engine optimization, market research, medical research, and art.

<span class="mw-page-title-main">NodeXL</span> Network analysis and visualization package for Microsoft Excel

NodeXL is a network analysis and visualization software package for Microsoft Excel 2007/2010/2013/2016. The package is similar to other network visualization tools such as Pajek, UCINet, and Gephi. It is widely applied in ring, mapping of vertex and edge, and customizable visual attributes and tags. NodeXL enables researchers to undertake social network analysis work metrics such as centrality, degree, and clustering, as well as monitor relational data and describe the overall relational network structure. When applied to Twitter data analysis, it showed the total network of all users participating in public discussion and its internal structure through data mining. It allows social Network analysis (SNA) to emphasize the relationships rather than the isolated individuals or organizations, allowing interested parties to investigate the two-way dialogue between organizations and the public. SNA also provides a flexible measurement system and parameter selection to confirm the influential nodes in the network, such as in-degree and out-degree centrality. The software contains network visualization, social network analysis features, access to social media network data importers, advanced network metrics, and automation.

Semantic Scholar is a research tool for scientific literature powered by artificial intelligence. It is developed at the Allen Institute for AI and was publicly released in November 2015. Semantic Scholar uses modern techniques in natural language processing to support the research process, for example by providing automatically generated summaries of scholarly papers. The Semantic Scholar team is actively researching the use of artificial intelligence in natural language processing, machine learning, human–computer interaction, and information retrieval.

The domain authority of a website describes its relevance for a specific subject area or industry. Domain Authority is a search engine ranking score developed by Moz. This relevance has a direct impact on its ranking by search engines, trying to assess domain authority through automated analytic algorithms. The relevance of domain authority on website-listing in the Search Engine Results Page (SERPs) of search engines led to the birth of a whole industry of Black-Hat SEO providers, trying to feign an increased level of domain authority. The ranking by major search engines, e.g., Google’s PageRank is agnostic of specific industry or subject areas and assesses a website in the context of the totality of websites on the Internet. The results on the SERP page set the PageRank in the context of a specific keyword. In a less competitive subject area, even websites with a low PageRank can achieve high visibility in search engines, as the highest ranked sites that match specific search words are positioned on the first positions in the SERPs.

<span class="mw-page-title-main">Main path analysis</span> Mathematical tool

Main path analysis is a mathematical tool, first proposed by Hummon and Doreian in 1989, to identify the major paths in a citation network, which is one form of a directed acyclic graph (DAG). It has since become an effective technique for mapping technological trajectories, exploring scientific knowledge flows, and conducting literature reviews.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities.

References

  1. Egghe, Leo; Rousseau, Ronald (1990). Introduction to Informetrics : quantitative methods in library, documentation and information science. Amsterdam, the Netherlands: Elsevier Science Publishers. p. 228. ISBN   0-444-88493-9.
  2. 1 2 James R Clough; Jamie Gollings; Tamar V Loach; Tim S Evans (2015). "Transitive reduction of citation networks". Journal of Complex Networks. 3 (2): 189–203. arXiv: 1310.8224 . doi:10.1093/comnet/cnu039. S2CID   10228152.
  3. Zhao, Dangzhi; Strotmann, Andreas (2015-02-01). Analysis and Visualization of Citation Networks. Morgan & Claypool Publishers. ISBN   978-1-60845-939-1.
  4. Structures and Statistics of Citation Networks, Miray Kas
  5. Życzkowski, Karol (2010-10-01). "Citation graph, weighted impact factors and performance indices". Scientometrics. 85 (1): 301–315. arXiv: 0904.2110 . doi:10.1007/s11192-010-0208-6. ISSN   1588-2861. S2CID   7614954.
  6. 1 2 Lu, Wangzhong; Janssen, J.; Milios, E.; Japkowicz, N.; Zhang, Yongzheng (2007-01-01). "Node similarity in the citation graph". Knowledge and Information Systems. 11 (1): 105–129. doi:10.1007/s10115-006-0023-9. ISSN   0219-3116. S2CID   26234247.
  7. Bolelli, Levent; Ertekin, Seyda; Giles, C. Lee (2006). "Clustering Scientific Literature Using Sparse Citation Graph Analysis". In Fürnkranz, Johannes; Scheffer, Tobias; Spiliopoulou, Myra (eds.). Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science. Vol. 4213. Berlin, Heidelberg: Springer. pp. 30–41. doi: 10.1007/11871637_8 . ISBN   978-3-540-46048-0. S2CID   15527080.
  8. Hummon, Norman P.; Dereian, Patrick (1989-03-01). "Connectivity in a citation network: The development of DNA theory". Social Networks. 11 (1): 39–63. doi:10.1016/0378-8733(89)90017-8. ISSN   0378-8733.
  9. Liu, Hanwen; Kou, Huaizhen; Yan, Chao; Qi, Lianyong (2020-04-24). "Keywords-Driven and Popularity-Aware Paper Recommendation Based on Undirected Paper Citation Graph". Complexity. 2020: e2085638. doi: 10.1155/2020/2085638 . ISSN   1076-2787.
  10. Sadeghian, Ali; Sundaram, Laksshman; Wang, Daisy Zhe; Hamilton, William F.; Branting, Karl; Pfeifer, Craig (June 2018). "Automatic semantic edge labeling over legal citation graphs". Artificial Intelligence and Law. 26 (2): 127–144. doi:10.1007/s10506-018-9217-1. ISSN   0924-8463. S2CID   254266762.
  11. Lauscher, Anne; Eckert, Kai; Galke, Lukas; Scherp, Ansgar; Rizvi, Syed Tahseen Raza; Ahmed, Sheraz; Dengel, Andreas; Zumstein, Philipp; Klein, Annette (2018-05-23). "Linked Open Citation Database". Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (PDF). JCDL '18. New York, NY, USA: Association for Computing Machinery. pp. 109–118. doi:10.1145/3197026.3197050. ISBN   978-1-4503-5178-2. S2CID   4902279.

Further reading