Mass digitization

Last updated

Mass digitization is a term used to describe "large-scale digitization projects of varying scopes." Such projects include efforts to digitize physical books, on a mass scale, to make knowledge openly and publicly accessible and are made possible by selecting cultural objects, prepping them, scanning them, and constructing necessary digital infrastructures including digital libraries. These projects are often piloted by cultural institutions and private bodies, however, individuals may attempt to conduct a mass digitization effort as well. Mass digitization efforts occur quite often; millions of files (books, photos, color swatches, etc.) are uploaded to large-scale public or private online archives every single day. This practice of taking the physical to the digital on a mass realm changes the way we interact with knowledge. The history of mass digitization can be traced as early as the mid-1800s with the advent of microfilm, and technical infrastructures such as the internet, data farms, and computer data storage make these efforts technologically possible. This seemingly simple process of digitization of physical knowledge, or even products, has vast implications that can be explored.

Contents

History of Mass Digitization Initiatives

Fictional Considerations

Perhaps one of the most notable considerations of mass digitization, in a fictional sense, is the speculations on the Library of Babel by Jorge Luis Borges. In this account, Borges describes a vision of a library in which every possible permutation of books were available. [1] Although Borges describes the preservation and archival practices of all knowledge in a physical space (a library), Borges' fictional vision has already taken place in a digital sense. Endless copies of online books are freely available to the public by means of internet archives or library databases. An account like this was actually quite common, and expertly conveys the idea that "the dream and practice of mass digitization cultural works have been around for decades." [2]

Non-fictional considerations

Some of the earliest digitization programs started before the age of the internet, and include the adaption of technologies such as microfilm in the 19th century. The technical affordances of microfilm allowed it to be a significant medium in the efforts to preserve and extend library materials, as well as its feature of "graphically dramatizing questions of scale." Microfilm was also known as microphotography, developed in 1839, and its capabilities demonstrate (perhaps for the first time) the ability to store mass amounts of information, in this case photos, on a physically small space. When discussing the affordances of microfilm, it was noted by an observer that, "the whole archives of the nation might be packed away in a snuffbox." Such notes expertly demonstrate how the technical infrastructure of microfilm could be leveraged to archive and preserve on a mass scale. Paul Otlet, a Belgian author often considered one of the founders of information science, "outlined the benefits of microfilm as a stable and long-term remediation format that could be used to extend the reach of literature" in his 1906 work "Sur une forme nouvelle du livre : le livre microphotographique". His claim was proven right, with the Library of Congress and other bodies using microfilm to "digitize" cultural objects such as manuscripts, books, images, and newspapers in the early 20th century.

Technical Infrastructures

Microfilm

Microfilm represents a shift in the infrastructure of data storage: an immense amount of pictures could be stored in a physically small space, and then expanded for viewing with the help of the microfilm machine. Microfilm, in combination with the microfilm viewer, were leveraged to allow objects to be digitized, preserved, and viewed on a mass scale. It is interesting to note that students needed the help of staff before using the machine; accessing digital materials now is a swift, easy process that one can conduct independently. More information on microfilm can be found under the "Non-fictional considerations" tab of this page.

Server Farms

Another large shift in the infrastructure of data storage was the advent server farms. Websites rely on server farms for “scalability, reliability, and low-latency access to Internet content”. According to Burns,[ author incomplete ] these technologies are essential when building a high-performance infrastructure for content delivery. Moving from microfilm to complex server farms with their own schemas demonstrates the infrastructural demands mass digitization requires over time. Here, mass digitization is both facilitated and exists in this place.[ clarification needed ] Without server farms, data would not be able to be stored or accessed on the necessary scale for mass digitization projects. However, it is important to note that server farms do not act alone in storing data. Other web based infrastructures aid greatly in the storage of data, such as hard drives on a personal computer. Encryption tools and services also work to protect and secure data in sensitive, or internal use, mass digitization projects.

Databases

Databases are often seen as the "home" of a variety of mass digitization efforts. Databases, such as Google Books, allow one to view an entire collection of digitized objects. In the case of Google Books, the database allows a user to search, research, and preview an estimated 40 million titles, corresponding to roughly 30% of the estimated number of all books ever published that the Google team has scanned and uploadedHowever, faults do exist within such databases; the hands of a scanner can accidentally be scanned and posted, as opposed to the page of a book itself. Errors such as these in public, and often permanent, databases call into question the efficiency of human efforts in mass digitization projects.

Other databases allow researchers from all over the world to upload or view data for scientific inquiry. In this case, raw data from scientific experiments - anonymized for participant privacy - is uploaded and stored on a mass scale. A prime example of such databases for research purposes include the Child Language Data Exchange System (CHILDES) Database. This database houses raw data for language acquisition, and includes videos, audio, transcripts, and de-identified participant information. Databases that store published research articles also exist, and include sites such as PubMed, ScienceDirect, JSTOR, and EBSCO.

Databases, in conjunction with server farms and other web based infrastructures, allow for crucial collaboration in the scientific realm. Here, mass digitization has expanded from the digitization of physical objects (such as books) to the digitization of interactions for scientific inquiry.

Implications

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

<span class="mw-page-title-main">Brewster Kahle</span> American computer engineer, founder of the Internet Archive

Brewster Lurton Kahle is an American digital librarian, a computer engineer, Internet entrepreneur, and advocate of universal access to all knowledge. In 1996, Kahle founded the Internet Archive and co-founded Alexa Internet. In 2012, he was inducted into the Internet Hall of Fame.

<span class="mw-page-title-main">Digitization</span> Converting information into digital form

Digitization is the process of converting information into a digital format. The result is the representation of an object, image, sound, document, or signal obtained by generating a series of numbers that describe a discrete set of points or samples. The result is called digital representation or, more specifically, a digital image, for the object, and digital form, for the signal. In modern practice, the digitized data is in the form of binary numbers, which facilitates processing by digital computers and other operations, but digitizing simply means "the conversion of analog source material into a numerical format"; the decimal or any other number system can be used instead.

In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

<span class="mw-page-title-main">Microform</span> Forms with microreproductions of documents

A microform is a scaled-down reproductions of a document, typically either photographic film or paper, made for the purposes of transmission, storage, reading, and printing. Microform images are commonly reduced to about 4% or 125 of the original document size. For special purposes, greater optical reductions may be used.

<span class="mw-page-title-main">Mundaneum</span> Institution aimed to gather together all the worlds knowledge founded in 1910

The Mundaneum was an institution which aimed to gather together all the world's knowledge and classify it according to a system called the Universal Decimal Classification. It was developed at the turn of the 20th century by Belgian lawyers Paul Otlet and Henri La Fontaine. The Mundaneum has been identified as a milestone in the history of data collection and management, and as a precursor to the Internet.

The California Digital Library (CDL) was founded by the University of California in 1997. Under the leadership of then UC President Richard C. Atkinson, the CDL's original mission was to forge a better system for scholarly information management and improved support for teaching and research. In collaboration with the ten University of California Libraries and other partners, CDL assembled one of the world's largest digital research libraries. CDL facilitates the licensing of online materials and develops shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the country and works in partnership with the UC campuses to bring the treasures of California's libraries, museums, and cultural heritage organizations to the world. CDL continues to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

The Brittle Books Program is an initiative carried out by the National Endowment for the Humanities at the request of the United States Congress. The initiative began officially between 1988 and 1989 with the intention to involve the eventual microfilming of over 3 million endangered volumes.

Oral history preservation is the field that deals with the care and upkeep of oral history materials, whatever format they may be in. Oral history is a method of historical documentation, using interviews with living survivors of the time being investigated. Oral history often touches on topics scarcely touched on by written documents, and by doing so, fills in the gaps of records that make up early historical documents.

Cloud storage is a model of computer data storage in which data, said to be on "the cloud", is stored remotely in logical pools and is accessible to users over a network, typically the Internet. The physical storage spans multiple servers, and the physical environment is typically owned and managed by a cloud computing provider. These cloud storage providers are responsible for keeping the data available and accessible, and the physical environment secured, protected, and running. People and organizations buy or lease storage capacity from the providers to store user, organization, or application data.

Google App Engine is a cloud computing platform as a service for developing and hosting web applications in Google-managed data centers. Applications are sandboxed and run across multiple servers. App Engine offers automatic scaling for web applications—as the number of requests increases for an application, App Engine automatically allocates more resources for the web application to handle the additional demand.

A digital library, also called an online library, an internet library, a digital repository, a library without walls, or a digital collection, is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

<span class="mw-page-title-main">History of hypertext</span>

Hypertext is text displayed on a computer or other electronic device with references (hyperlinks) to other text that the reader can immediately access, usually by a mouse click or keypress sequence. Early conceptions of hypertext defined it as text that could be connected by a linking system to a range of other documents that were stored outside that text. In 1934 Belgian bibliographer, Paul Otlet, developed a blueprint for links that telescoped out from hypertext electrically to allow readers to access documents, books, photographs, and so on, stored anywhere in the world.

<span class="mw-page-title-main">Robert Goldschmidt</span>

Robert B. Goldschmidt (1877–1935) was a Belgian chemist, physicist, and engineer who first proposed the idea of standardized microfiche (microfilm).

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">Digital object memory</span>

A digital object memory (DOMe) is a digital storage space intended to keep permanently all related information about a concrete physical object instance that is collected during the lifespan of this object and thus forms a basic building block for the Internet of Things (IoT) by connecting digital information with physical objects.

Elliptics is a distributed key–value data storage with open source code. By default it is a classic distributed hash table (DHT) with multiple replicas put in different groups. Elliptics was created to meet requirements of multi-datacenter and physically distributed storage locations when storing huge amount of medium and large files.

References

  1. Borges, Jorge Luis (2001). Prólogos de La biblioteca de Babel. Madrid: Alianza Editorial. ISBN   84-206-3875-7. OCLC   57893246.
  2. Thylstrup, Nanna Bonde (2019). The politics of mass digitization. Cambridge. ISBN   978-0-262-35005-1. OCLC   1078691226.{{cite book}}: CS1 maint: location missing publisher (link)