Bibliomining is the use of a combination of data mining, data warehousing, and bibliometrics for the purpose of analyzing library services. [1] [2] The term was created in 2003 by Scott Nicholson, Assistant Professor, Syracuse University School of Information Studies, in order to distinguish data mining in a library setting from other types of data mining. [3]
First a data warehouse must be created. This is done by compiling information on the resources, such as titles and authors, subject headings, and descriptions of the collections. Then the demographic surrogate information is organized. Finally the library information (such as the librarian, whether or not the information came from the reference desk or circulation desk, and the location of the library) is obtained.
Once this is organized, the data can be processed and analyzed. This can be done via a few methods, such as online analytical processing (OLAP), using a data mining program, or through data visualization.
Bibliomining is used to discover patterns in what people are reading and researching and allows librarians to target their community better. Bibliomining can also help library directors focus their budgets on resources that will be utilized. Another use is to determine when people use the library more often, so staffing needs can be adequately met. Combining bibliomining with other research techniques such as focus groups, surveys and cost-benefit analysis, will help librarians to get a better picture of their patrons and their needs.
There is some concern that data mining violates patron privacy. But by extracting the data, all personally identifiable information is deleted, and the data warehouse is clean. The original patron data can then be totally deleted and there will be no way to link the new data to a particular patron. This can be done in a few ways. One, used with information regarding database access, is to track the IP address, but then replace it with a similar code, that will allow identification without violating privacy. Another is to keep track of an item returned to the library and create a "demographic surrogate" of the patron. The demographic surrogate would not give any identifiable information such as names, library card numbers or addresses.
The other concern in bibliomining is that it only provides data in a very detached manner. Information is given as to how a patron uses library resources, but there is no way to track if the resources met the user's needs completely. Someone could take out a book on a topic, but not find the information they were seeking. Bibliomining only helps identify which books are used, not how useful they actually were. Bibliomining cannot provide information on how well a collection serves a patron. In order to counteract this, bibliomining must be used in accordance with other research techniques.
In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.
A library is a collection of materials or media that are accessible for use and not just for display. It provides physical or digital access to material, and may be a physical location or a virtual space, or both. A library's collection can include printed materials and other physical resources in many formats such as DVDs, as well as access to information, music or other content held on bibliographic databases.
The reference desk or information desk of a library is a public service counter where professional librarians provide library users with direction to library materials, advice on library collections and services, and expertise on multiple kinds of information from multiple sources.
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them. It is also known as data privacy or data protection.
In computing, extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s). The ETL process became a popular concept in the 1970s and is often used in data warehousing.
A librarian is a person who works professionally in a library, providing access to information, and sometimes social or technical programming, or instruction on information literacy to users.
Web mining is the application of data mining techniques to discover patterns from the World Wide Web. It uses automated methods to extract both structured and unstructured data from web pages, server logs and link structures. There are three main sub-categories of web mining. Web content mining extracts information from within a page. Web structure mining discovers the structure of the hyperlinks between documents, categorizing sets of web pages and measuring the similarity and relationship between different sites. Web usage mining finds patterns of usage of web pages.
Bibliometrics is the use of statistical methods to analyse books, articles and other publications. Bibliometric methods are frequently used in the field of library and information science. The sub-field of bibliometrics which concerns itself with the analysis of scientific publications is called scientometrics. Citation analysis is a commonly used bibliometric method which is based on constructing the citation graph, a network or graph representation of the citations between documents. Many research fields use bibliometric methods to explore the impact of their field, the impact of a set of researchers, the impact of a particular paper, or to identify particularly impactful papers within a specific field of research. Bibliometrics also has a wide range of other applications, such as in descriptive linguistics, the development of thesauri, and evaluation of reader usage.
Digital reference is a service by which a library reference service is conducted online, and the reference transaction is a computer-mediated communication. It is the remote, computer-mediated delivery of reference information provided by library professionals to users who cannot access or do not want face-to-face communication. Virtual reference service is most often an extension of a library's existing reference service program. The word "reference" in this context refers to the task of providing assistance to library users in finding information, answering questions, and otherwise fulfilling users’ information needs. Reference work often but not always involves using reference works, such as dictionaries, encyclopedias, etc. This form of reference work expands reference services from the physical reference desk to a "virtual" reference desk where the patron could be writing from home, work or a variety of other locations.
Library collection development is the process of building the library materials to meet the information needs of the users in a timely and economical manner using information resources locally held, as well as from other organizations.
The following outline is provided as an overview of and topical guide to library science:
A reference interview is a conversation between a librarian and a library user, usually at a reference desk, in which the librarian responds to the user's initial explanation of his or her information need by first attempting to clarify that need and then by directing the user to appropriate information resources.
In library and archival science, preservation is a set of activities aimed at prolonging the life of a record, book, or object while making as few changes as possible. Preservation activities vary widely and may include monitoring the condition of items, maintaining the temperature and humidity in collection storage areas, writing a plan in case of emergencies, digitizing items, writing relevant metadata, and increasing accessibility. Preservation, in this definition, is practiced in a library or an archive by a librarian, archivist, or other professional when they perceive a record is in need of care.
Scholarly communication involves the creation, publication, dissemination and discovery of academic research, primarily in peer-reviewed journals and books. It is “the system through which research and other scholarly writings are created, evaluated for quality, disseminated to the scholarly community, and preserved for future use." This primarily involves the publication of peer-reviewed academic journals, books and conference papers.
Customer analytics is a process by which data from customer behavior is used to help make key business decisions via market segmentation and predictive analytics. This information is used by businesses for direct marketing, site selection, and customer relationship management. Marketing provides services in order to satisfy customers. With that in mind, the productive system is considered from its beginning at the production level, to the end of the cycle at the consumer. Customer analytics plays an important role in the prediction of customer behavior.
Digital artifactual value, a preservation term, is the intrinsic value of a digital object, rather than the informational content of the object. Though standards are lacking, born-digital objects and digital representations of physical objects may have a value attributed to them as artifacts.
De-identification is the process used to prevent someone's personal identity from being revealed. For example, data produced during human subject research might be de-identified to preserve the privacy of research participants. Biological data may be de-identified in order to comply with HIPAA regulations that define and stipulate patient privacy laws.
Roving reference, also called roaming reference, is a library service model in which, instead of being positioned at a static reference desk, a librarian moves throughout the library to locate patrons with questions or concerns and offer them help in finding or using library resources.
Privacy in education refers to the broad area of ideologies, practices, and legislation that involve the privacy rights of individuals in the education system. Concepts that are commonly associated with privacy in education include the expectation of privacy, the Family Educational Rights and Privacy Act (FERPA), the Fourth Amendment, and the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Most privacy in education concerns relate to the protection of student data and the privacy of medical records. Many scholars are engaging in an academic discussion that covers the scope of students’ privacy rights, from student in K-12 and even higher education, and the management of student data in an age of rapid access and dissemination of information.