Bibliographic coupling

Last updated

Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter. [1]

Contents

Two documents are bibliographically coupled if they both cite one or more documents in common. The "coupling strength" of two given documents is higher the more citations to other documents they share. The figure to the right illustrates the concept of bibliographic coupling. In the figure, documents A and B both cite documents C, D and E. Thus, documents A and B have a bibliographic coupling strength of 3 - the number of elements in the intersection of their two reference lists.

Similarly, two authors are bibliographically coupled if the cumulative reference lists of their respective oeuvres each contain a reference to a common document, and their coupling strength also increases with the citations to other documents that their share. If the cumulative reference list of an author's oeuvre is determined as the multiset union of the documents that the author has co-authored, then the author bibliographic coupling strength of two authors (or more precisely, of their oeuvres) is defined as the size of the multiset intersection of their cumulative reference lists, however. [2]

Bibliographic coupling can be useful in a wide variety of fields, since it helps researchers find related research done in the past. On the other hand, two documents are co-cited if they are both independently cited by one or more documents.

History

The concept of bibliographic coupling was introduced by M. M. Kessler of MIT in a paper published in 1963, [3] and has been embraced in the work of the information scientist Eugene Garfield. [4] It is one of the earliest citation analysis methods for document similarity computation and some have questioned its usefulness, pointing out that two works may reference completely unrelated subject matter in the third. Furthermore, bibliographic coupling is a retrospective similarity measure, [5] meaning the information used to establish the similarity relationship between documents lies in the past and is static, i.e. bibliographic coupling strength cannot change over time, since outgoing citation counts are fixed.

The co-citation analysis approach introduced by Henry Small and published in 1973 addressed this shortcoming of bibliographic coupling by considering a document's incoming citations to assess similarity, a measure that can change over time. Additionally, the co-citation measure reflects the opinion of many authors and thus represents a better indicator of subject similarity. [6]

In 1972 Robert Amsler published a paper [7] describing a measure for determining subject similarity between two documents by fusing bibliographic coupling and co-citation analysis. [8]

In 1981 Howard White and Belver Griffith introduced author co-citation analysis (ACA). [9] Not until 2008 did Dangzhi Zhao and Andreas Strotmann combine their work and that of M. M. Kessler to define author bibliographic coupling analysis (ABCA), noting that as long as authors are active this metric is not static and that it is particularly useful when combined with ACA. [2]

More recently, in 2009, Gipp and Beel introduced a new approach termed Co-citation Proximity Analysis (CPA). CPA is based on the concept of co-citation, but represents a refinement to Small's measure in that CPA additionally considers the placement and proximity of citations within a document's full-text. The assumption is that citations in closer proximity are more likely to exhibit a stronger similarity relationship. [10]

In summary, a chronological overview of citation analysis methods includes:

Applications

Online sites that make use of bibliographic coupling include The Collection of Computer Science Bibliographies Archived 2011-06-07 at the Wayback Machine and CiteSeer.IST

See also

Notes

  1. Martyn, J (1964). "Bibliographic coupling". Journal of Documentation. 20 (4): 236. doi:10.1108/eb026352.
  2. 1 2 Zhao, D.; Strotmann, A. (2008). "Evolution of research activities and intellectual influences in information science 1996–2005: Introducing author bibliographic-coupling analysis". Journal of the American Society for Information Science and Technology. 59 (13): 2070–2086. doi:10.1002/asi.20910.
  3. "Bibliographic coupling between scientific papers," American Documentation 24 (1963), pp. 123-131.
  4. See for example "Multiple Independent Discovery and Creativity in Science," Current Contents, Nov. 3, 1980, pp. 5-10, reprinted in Essays of an Information Scientist, vol. 4 (1979-80), pp. 660-665.
  5. Garfield Eugene, 2001.From Bibliographic Coupling to Co-Citation Analysis via Algorithmic Historio-Bibliography presented at Drexel University, Philadelphia, PA
  6. Henry Small, 1973. "Co-citation in the scientific literature: A new measure of the relationship between two documents" Archived 2012-12-02 at the Wayback Machine . Journal of the American Society for Information Science (JASIS), volume 24(4), pp. 265-269. doi = 10.1002/asi.4630240406
  7. Robert Amsler, Dec. 1972 "Applications of citation-based automatic classification", Linguistics Research Center, University Texas at Austin, Technical Report 72-14.
  8. Class Amsler written by Bruno Martins and developed by the XLDB group of the Department of Informatics of the Faculty of Sciences of the University of Lisbon in Portugal
  9. White, Howard D.; Griffith, Belver C. (1981). "Author Cocitation: A Literature Measure of Intellectual Structure". Journal of the American Society for Information Science. 32 (3): 163–171. doi:10.1002/asi.4630320302.
  10. Bela Gipp and Joeran Beel, 2009 Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis in Proceedings of the 12th international conference on scientometrics and informetrics (issi’09), Rio de Janeiro (Brazil), 2009, pp. 571-575.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Citation</span> Reference to a source

A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.

Gerard A. "Gerry" Salton was a professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father of Information Retrieval". His group at Cornell developed the SMART Information Retrieval System, which he initiated when he was at Harvard. It was the very first system to use the now popular vector space model for Information Retrieval.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

<span class="mw-page-title-main">Browsing</span>

Browsing is a kind of orienting strategy. It is supposed to identify something of relevance for the browsing organism. When used about human beings it is a metaphor taken from the animal kingdom. It is used, for example, about people browsing open shelves in libraries, window shopping, or browsing databases or the Internet.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.

<span class="mw-page-title-main">Informetrics</span> Study of the quantitative aspects of information

Informetrics is the study of quantitative aspects of information, it is an extension and evolution of traditional bibliometrics and scientometrics. Informetrics uses bibliometrics and scientometrics methods to study mainly the problems of literature information management and evaluation of science and technology. Informetrics is an independent discipline that uses quantitative methods from mathematics and statistics to study the process, phenomena, and law of informetrics. Informetrics has gained more attention as it is a common scientific method for academic evaluation, research hotspots in discipline, and trend analysis.

<span class="mw-page-title-main">Ricardo Baeza-Yates</span>

Ricardo A. Baeza-Yates is a Chilean-Catalan computer scientist that currently is a research professor at the Institute for Experiential AI of Northeastern University in the Silicon Valley campus. He is also part-time professor at Universitat Pompeu Fabra in Barcelona and Universidad de Chile in Santiago. He is an expert member of the Global Partnership on Artificial Intelligence, a member of Spain's Advisory Council on AI, and a member of the Association for Computing Machinery's US Technology Policy Subcommittee on AI and Algorithms.

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

Nicholas J. Belkin is a professor at the School of Communication and Information at Rutgers University. Among the main themes of his research are digital libraries; information-seeking behaviors; and interaction between humans and information retrieval systems. Belkin is best known for his work on human-centered Information Retrieval and the hypothesis of Anomalous State of Knowledge (ASK). Belkin realized that in many cases, users of search systems are unable to precisely formulate what they need. They miss some vital knowledge to formulate their queries. In such cases it is more suitable to attempt to describe a user's anomalous state of knowledge than to ask the user to specify her/his need as a request to the system.

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with XML. As such it is used for computing relevance of XML documents.

<span class="mw-page-title-main">Co-citation</span>

Co-citation is the frequency with which two documents are cited together by other documents. If at least one other document cites two documents in common, these documents are said to be co-cited. The more co-citations two documents receive, the higher their co-citation strength, and the more likely they are semantically related. Like bibliographic coupling, co-citation is a semantic similarity measure for documents that makes use of citation analyses.

<span class="mw-page-title-main">Citation graph</span> Directed graph describing citations in documents

A citation graph, in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

<span class="mw-page-title-main">Co-citation Proximity Analysis</span> Document similarity measure that uses citation analysis

Co-citation Proximity Analysis (CPA) is a document similarity measure that uses citation analysis to assess semantic similarity between documents at both the global document level as well as at individual section-level. The similarity measure builds on the co-citation analysis approach, but differs in that it exploits the information implied in the placement of citations within the full-texts of documents.

Bibliometrix is a package for the R statistical programming language for quantitative research in scientometrics and bibliometrics.

References

Bibliographic Coupling

Author Bibliographic Coupling

Co-citation analysis

Co-citation Proximity Analysis (CPA)

Author Co-citation Analysis (ACA)

Citation Studies in a More General Context

Further reading

For an interesting summary of the progression of the study of citations see. [1] The paper is more a memoir than a research paper, filled with decisions, research expectations, interests and motivations—including the story of how Henry Small approached Belver Griffith with the idea of co-citation and they became collaborators, mapping science as a whole.

  1. Small, Henry (2001). "Belver and Henry". Scientometrics. 51 (3): 489–497. doi:10.1023/a:1019690918490. S2CID   5962665.
  2. Bela Gipp, Norman Meuschke & Mario Lipinski, 2015. "CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central" in Proceedings of the iConference 2015, Newport Beach, California, 2015.