Part of a series on |
Citation metrics |
---|
Co-citation is the frequency with which two documents are cited together by other documents. [1] If at least one other document cites two documents in common, these documents are said to be co-cited. The more co-citations two documents receive, the higher their co-citation strength, and the more likely they are semantically related. [1] Like bibliographic coupling, co-citation is a semantic similarity measure for documents that makes use of citation analyses.
The figure to the right illustrates the concept of co-citation and a more recent variation of co-citation which accounts for the placement of citations in the full text of documents. The figure's left image shows the Documents A and B, which are both cited by Documents C, D and E; thus Documents A and B have a co-citation strength, or co-citation index [2] of three. This score is usually established using citation indexes. Documents featuring high numbers of co-citations are regarded as more similar. [1]
The figure's right image shows a citing document which cites the Documents 1, 2 and 3. Both the Documents 1 and 2 and the Documents 2 and 3 have a co-citation strength of one, given that they are cited together by exactly one other document. However, Documents 2 and 3 are cited in much closer proximity to each other in the citing document compared to Document 1. To make co-citation a more meaningful measure in this case, a Co-Citation Proximity Index (CPI) can be introduced to account for the placement of citations relative to each other. Documents co-cited at greater relative distances in the full text receive lower CPI values. [3] Gipp and Beel were the first to propose using modified co-citation weights based on proximity. [4]
Henry Small [1] and Irina Marshakova [5] are credited for introducing co-citation analysis in 1973. [2] Both researchers came up with the measure independently, although Marshakova gained less credit, likely because her work was published in Russian. [6]
Co-citation analysis provides a forward-looking assessment on document similarity in contrast to Bibliographic Coupling, which is retrospective. [7] The citations a paper receives in the future depend on the evolution of an academic field, thus co-citation frequencies can still change. In the adjacent diagram, for example, Doc A and Doc B may still be co-cited by future documents, say Doc F and Doc G. This characteristic of co-citation allows for a dynamic document classification system when compared to Bibliographic Coupling.
Over the decades, researchers proposed variants or enhancements to the original co-citation concept. Howard White introduced author co-citation analysis in 1981. [8] Gipp and Beel proposed Co-citation Proximity Analysis (CPA) and introduced the CPI as an enhancement to the original co-citation concept in 2009. [3] Co-citation Proximity Analysis considers the proximity of citations within the full-texts for similarity computation and therefore allows for a more fine-grained assessment of semantic document similarity than pure co-citation. [9]
The motivations of authors for citing literature can vary greatly and occur for a variety of reasons aside from simply referring to academically relevant documents. Cole and Cole expressed this concern based on the observation that scientists tend to cite friends and research colleagues more frequently, a partiality known as cronyism . [10] Additionally, it has been observed that academic works which have already gained much credit and reputation in a field tend to receive even more credit and thus citations in future literature, an observation termed the Matthew effect in science.
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose of acknowledging the relevance of the works of others to the topic of discussion at the spot where the citation appears.
Scientific citation is providing detailed reference in a scientific publication, typically a paper or book, to previous published communications which have a bearing on the subject of the new publication. The purpose of citations in original work is to allow readers of the paper to refer to cited work to assist them in judging the new work, source background information vital for future development, and acknowledge the contributions of earlier workers. Citations in, say, a review paper bring together many sources, often recent, in one place.
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
Bibliometrics is the use of statistical methods to analyse books, articles and other publications, especially in scientific contents. Bibliometric methods are frequently used in the field of library and information science. Bibliometrics is closely associated with scientometrics, the analysis of scientific metrics and indicators, to the point that both fields largely overlap.
Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic articles and books. For another example, judges of law support their judgements by referring back to judgements made in earlier cases. An additional example is provided by patents which contain prior art, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.
Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter.
JabRef is an open-source, cross-platform citation and reference management software. It is used to collect, organize and search bibliographic information.
In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.
Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.
Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.
A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.
A citation graph, in information science and bibliometrics, is a directed graph that describes the citations within a collection of documents.
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.
Co-citation Proximity Analysis (CPA) is a document similarity measure that uses citation analysis to assess semantic similarity between documents at both the global document level as well as at individual section-level. The similarity measure builds on the co-citation analysis approach, but differs in that it exploits the information implied in the placement of citations within the full-texts of documents.
In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.