Bibliogram

Last updated

A bibliogram is a graphical representation of the frequency of certain target words, usually noun phrases, in a given text. The term was introduced in 2005 by Howard D. White to name the linguistic object studied, but not previously named, in informetrics, scientometrics and bibliometrics. The noun phrases in the ranking may be authors, journals, subject headings, or other indexing terms. The "stretches of text” may be a book, a set of related articles, a subject bibliography, a set of Web pages, and so on. Bibliograms are always generated from writings, usually from scholarly or scientific literature.

Contents

Definition

A bibliogram is verbal construct made when noun phrases from extended stretches of text are ranked high to low by their frequency of co-occurrence with one or more user-supplied seed terms. Each bibliogram has three components:

As a family of term-frequency distributions, the bibliogram has frequently been written about under descriptions such as:

It is sometimes called a "core and scatter" distribution. The "core" consists of relatively few top-ranked terms that account for a disproportionately large share of co-occurrences overall.

The "scatter” consists of relatively many lower-ranked terms that account for the remaining share of co-occurrences. Usually the top-ranked terms are not tied in frequency, but identical frequencies and tied ranks become more common as the frequencies get smaller. At the bottom of the distribution, a long tail of terms are tied in rank because each co-occurs with the seed term only once.

In most cases bibliograms can be described by power laws such as Zipf's law and Bradford's law. In this regard, they have long been studied by mathematicians and statisticians in information science. However, these treatments typically ignore the qualitative meanings of the ranked terms themselves, which are often of interest in their own right. For example, the following bibliogram was made with an author's name as seed and shows the descriptors that co-occur with her name in the ERIC database. The descriptors are ranked by how many of her articles they were used to index:

6   Creativity 4   Creativity Tests 3   Divergent Thinking 2   Elementary School Mathematics 2   Instruction 2   Mathematics Education 2   Problem Solving 2   Research 2   Time 1   Acceleration 1   Anxiety 1   Beginning Teachers 1   Behavioral Objectives 1   Child Development 1   Classroom Techniques 1   Cognitive Development     etc.

This author is a researcher in education, and it will be seen that the terms profile her intellectual interests over the years. In general, bibliograms can be used to:

Bibliograms can be created with the RANK command on Dialog (other vendors have similar commands), ranking options within WorldCat, HistCite, Google Scholar, and inexpensive content analysis software.

White suggests that bibliograms have a parallel construct in what he calls associograms. These are the rank-ordered lists of word association norms studied in psycholinguistics. They are similar to bibliograms in statistical structure but are not generated from writings. Rather, they are generated by presenting panels of people with a stimulus term (which functions like a seed term) and tabulating the words they associate with the seed by frequency of co-occurrence. They are currently of interest to information scientists as a nonstandard way of creating thesauri for document retrieval.

Examples

Other examples of bibliograms are the ordered set of an author's co-authors or the list of authors that are published in a specific journal together with their number of articles. A popular example is the list of additional titles to consider for purchase that you get when you search an item in Amazon. These suggested titles are the top terms in the "core" of a bibliogram formed with your search term as seed. The frequencies are counts of the times they have been co-purchased with the seed.

Examples of associagrams may be found in the Edinburgh Associative Thesaurus.

Other methods

Similar but different methods are used in data clustering and data mining. Google Sets also created a list of associated terms for a given set of terms.

See also

Related Research Articles

In grammar, a noun is a word that represents a concrete or abstract thing, such as living creatures, places, actions, qualities, states of existence, and ideas. A noun may serve as an object or subject within a phrase, clause, or sentence.

<span class="mw-page-title-main">Bradford's law</span> Pattern of references in science journals

Bradford's law is a pattern first described by Samuel C. Bradford in 1934 that estimates the exponentially diminishing returns of searching for references in science journals. One formulation is that if journals in a field are sorted by number of articles into three groups, each with about one-third of all articles, then the number of journals in each group will be proportional to 1:n:n2. There are a number of related formulations of the principle.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

Wiyot or Soulatluk (lit. 'your jaw') is an Algic language spoken by the Wiyot people of Humboldt Bay, California. The language's last native speaker, Della Prince, died in 1962.

In linguistics, a grammatical category or grammatical feature is a property of items within the grammar of a language. Within each category there are two or more possible values, which are normally mutually exclusive. Frequently encountered grammatical categories include:

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in Bill said Alice would arrive soon, and she did, the words Alice and she refer to the same person.

In the field of information retrieval, divergence from randomness, one of the first models, is one type of probabilistic model. It is basically used to test the amount of information carried in the documents. It is based on Harter's 2-Poisson indexing-model. The 2-Poisson model has a hypothesis that the level of the documents is related to a set of documents which contains words occur relatively greater than the rest of the documents. It is not a 'model', but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on notion of eliteness.

<span class="mw-page-title-main">Informetrics</span> Study of the quantitative aspects of information

Informetrics is the study of quantitative aspects of information, it is an extension and evolution of traditional bibliometrics and scientometrics. Informetrics uses bibliometrics and scientometrics methods to study mainly the problems of literature information management and evaluation of science and technology. Informetrics is an independent discipline that uses quantitative methods from mathematics and statistics to study the process, phenomena, and law of informetrics. Informetrics has gained more attention as it is a common scientific method for academic evaluation, research hotspots in discipline, and trend analysis.

Argobba is an Ethiopian Semitic language spoken in several districts of Afar, Amhara, and Oromia regions of Ethiopia by the Argobba people. It belongs to the South Ethiopic languages subgroup, and is closely related to Amharic.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

Dissociated press is a parody generator. The generated text is based on another text using the Markov chain technique. The name is a play on "Associated Press" and the psychological term dissociation.

In information retrieval, an index term is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the function by which libraries collect, organize and disseminate documents. They are used as keywords to retrieve documents in an information system, for instance, a catalog or a search engine. A popular form of keywords on the web are tags, which are directly visible and can be assigned by non-experts. Index terms can consist of a word, phrase, or alphanumerical term. They are created by analyzing the document either manually with subject indexing or automatically with automatic indexing or more sophisticated methods of keyword extraction. Index terms can either come from a controlled vocabulary or be freely assigned.

A reciprocal pronoun is a pronoun that indicates a reciprocal relationship. A reciprocal pronoun can be used for one of the participants of a reciprocal construction, i.e. a clause in which two participants are in a mutual relationship. The reciprocal pronouns of English are one another and each other, and they form the category of anaphors along with reflexive pronouns.

Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents within a field of knowledge.

Document clustering is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

<span class="mw-page-title-main">Nominal group (functional grammar)</span>

In systemic functional grammar (SFG), a nominal group is a group of words that represents or describes an entity, for example The nice old English police inspector who was sitting at the table with Mr Morse. Grammatically, the wording "The nice old English police inspector who was sitting at the table with Mr Morse" can be understood as a nominal group, which functions as the subject of the information exchange and as the person being identified as "Mr Morse".

<span class="mw-page-title-main">Co-occurrence network</span>

Co-occurrence network, sometimes referred to as a semantic network, is a method to analyze text that includes a graphic visualization of potential relationships between people, organizations, concepts, biological organisms like bacteria or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text compliant to text mining.

Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods.

<span class="mw-page-title-main">Jad language</span> Tibetic language spoken in India

Jad (Dzad), also known as Bhotia and Tchhongsa, is a language spoken by a community of about 300 in the states of Uttarakhand and Himachal Pradesh, in India. It is spoken in several villages, and the three major villages are Jadhang, Nelang and Pulam Sumda in the Harsil sub-division of the Uttarkashi District. Jad is closely related to the Lahuli–Spiti language, which is another Tibetic language. Jad is spoken alongside Garhwali and Hindi. Code switching between Jad and Garhwali is very common. The language borrows some vocabulary from both Hindi and Garhwali. It is primarily a spoken language.

References