Newspaper digitization

Last updated

Newspaper digitization is the process of converting old newspapers from analog form into digital images. The most common analog forms for old newspapers are paper and microfilm. Digitized images of newspaper pages are typically (though not always) analyzed with OCR software in order to produce text files of the newspaper content. Newspaper digitization is a special case of digitization in general.

Contents

Newspapers preserve a rich record of the past, and since the advent of digital media, many institutions across the world have begun to digitize them and make the digital files publicly available. However, over 90% of newspapers remained unscanned in 2015. [1] Digitized newspapers may be made available for free or for a fee. Several lists (noted below) try to catalog digitized newspapers worldwide.

Successful newspaper scanning is a complex activity. Although scanning from paper is possible, microfilm scanning is cheaper and good microfilm has been called “the single most critical factor in the success of newspaper digitization.” [2] The OCR analysis of scanned pages presents a number of technical challenges and the text of old newspapers is often difficult to read, which introduces errors and complicates searching. Attaching metadata to images to make them more easily findable is another important step. Finally, search interfaces must be developed. A number of companies specialize in newspaper scanning and some produce software specially designed for the process.

The cost of storing printed newspapers and the relatively low demand for originals after microfilming and scanning means that printed newspapers, once microfilmed or scanned, have often been thrown out. Some people feel that this is a loss for researchers, or simply that there is a poignancy when the paper reading experience disappears. Author Nicholson Baker went so far as to create a paper newspaper archive, which he called the American Newspaper Repository, in order to preserve paper newspapers that would otherwise be discarded.

More recent newspapers may have been "born digital," meaning that they were printed from computer files rather than by letterpress or phototypesetting.[ citation needed ] They can be archived by storing the publisher's digital files of each page image rather than scanning the pages.

Finding aids and metasearch engines

See also

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Internet Archive</span> American nonprofit digital archive

The Internet Archive is an American nonprofit digital library founded in 1996 by Brewster Kahle. It provides free access to collections of digitized materials including websites, software applications, music, audiovisual and print materials. The Archive also advocates for a free and open Internet. As of February 4, 2024, the Internet Archive holds more than 44 million print materials, 10.6 million videos, 1 million software programs, 15 million audio files, 4.8 million images, 255,000 concerts, and over 835 billion web pages in its Wayback Machine. Its mission is committing to provide "universal access to all knowledge".

<span class="mw-page-title-main">Image scanner</span> Device that optically scans images, printed text

An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

<span class="mw-page-title-main">Digitization</span> Converting information into digital form

Digitization is the process of converting information into a digital format. The result is the representation of an object, image, sound, document, or signal obtained by generating a series of numbers that describe a discrete set of points or samples. The result is called digital representation or, more specifically, a digital image, for the object, and digital form, for the signal. In modern practice, the digitized data is in the form of binary numbers, which facilitates processing by digital computers and other operations, but digitizing simply means "the conversion of analog source material into a numerical format"; the decimal or any other number system can be used instead.

The National Digital Newspaper Program is a joint project between the National Endowment for the Humanities and the Library of Congress to create and maintain a publicly available, online digital archive of historically significant newspapers published in the United States between 1836 and 1922. Additionally, the program will make available bibliographic records and holdings information for some 140,000 newspaper titles from the 17th century to the present. Further, it will include scope notes and encyclopedia-style entries discussing the historical significance of specific newspapers. Added content will also include contextually relevant historical information. "One organization within each U.S. state or territory will receive an award to collaborate with relevant state partners in this effort." In March 2007 more than 226,000 pages of newspapers from California, Florida, Kentucky, New York, Utah, Virginia and the District of Columbia published between 1900 and 1910 were put online at a fully searchable site called "Chronicling America." As of December 2007, the total number of pages is about 413,000. This further expanded to be 1 million pages in 2009. Funding through the National Endowment for the Humanities is carried out through their "We The People" initiative.

In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

<span class="mw-page-title-main">Microform</span> Forms with microreproductions of documents

A microform is a scaled-down reproduction of a document, typically either photographic film or paper, made for the purposes of transmission, storage, reading, and printing. Microform images are commonly reduced to about 4% or 125 of the original document size. For special purposes, greater optical reductions may be used.

<span class="mw-page-title-main">Google Books</span> Service from Google

Google Books is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors through the Google Books Partner Program, or by Google's library partners through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives.

Intelligent character recognition (ICR) is used to extract handwritten text from images. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

Information International, Inc., commonly referred to as Triple-I or III, was an early computer technology company.

<span class="mw-page-title-main">Book scanning</span> Process of converting physical media into digital media

Book scanning or book digitization is the process of converting physical books and magazines into digital media such as images, electronic text, or electronic books (e-books) by using an image scanner. Large scale book scanning projects have made many books available online.

<span class="mw-page-title-main">Preservation (library and archive)</span> Set of activities aimed at prolonging the life of a record or object

In conservation, library and archival science, preservation is a set of preventive conservation activities aimed at prolonging the life of a record, book, or object while making as few changes as possible. Preservation activities vary widely and may include monitoring the condition of items, maintaining the temperature and humidity in collection storage areas, writing a plan in case of emergencies, digitizing items, writing relevant metadata, and increasing accessibility. Preservation, in this definition, is practiced in a library or an archive by a conservator, librarian, archivist, or other professional when they perceive a collection or record is in need of maintenance.

<span class="mw-page-title-main">Digital library</span> Online database of digital objects stored in electronic media formats and accessible via computers

A digital library is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

Document Capture Software refers to applications that provide the ability and feature set to automate the process of scanning paper documents or importing electronic documents, often for the purposes of feeding advanced document classification and data collection processes. Most scanning hardware, both scanners and copiers, provides the basic ability to scan to any number of image file formats, including: PDF, TIFF, JPG, BMP, etc. This basic functionality is augmented by document capture software, which can add efficiency and standardization to the process.

<span class="mw-page-title-main">Digital mailroom</span> Automation of incoming mail processes

Digital mailroom is the automation of incoming mail processes. Using document scanning and document capture technologies, companies can digitise incoming mail and automate the classification and distribution of mail within the organization. Both paper and electronic mail (email) can be managed through the same process allowing companies to standardize their internal mail distribution procedures and adhere to company compliance policies.

Eighteenth Century Collections Online (ECCO) is a digital collection of books published in Great Britain during the 18th century.

<span class="mw-page-title-main">Heritage Microfilm, Inc.</span> Microfilm digitization business based in Cedar Rapids, Iowa

Heritage Microfilm, Inc. is a preservation microfilm and microfilm digitization business located in Cedar Rapids, Iowa.

<span class="mw-page-title-main">Hill Museum & Manuscript Library</span> Museum and library in Collegeville, Minnesota

The Hill Museum & Manuscript Library (HMML) is a nonprofit organization that photographs, catalogs, and provides free access to collections of manuscripts located in libraries around the world.

Mass digitization is a term used to describe "large-scale digitization projects of varying scopes." Such projects include efforts to digitize physical books, on a mass scale, to make knowledge openly and publicly accessible and are made possible by selecting cultural objects, prepping them, scanning them, and constructing necessary digital infrastructures including digital libraries. These projects are often piloted by cultural institutions and private bodies, however, individuals may attempt to conduct a mass digitization effort as well. Mass digitization efforts occur quite often; millions of files are uploaded to large-scale public or private online archives every single day. This practice of taking the physical to the digital on a mass realm changes the way we interact with knowledge. The history of mass digitization can be traced as early as the mid-1800s with the advent of microfilm, and technical infrastructures such as the internet, data farms, and computer data storage make these efforts technologically possible. This seemingly simple process of digitization of physical knowledge, or even products, has vast implications that can be explored.

References

  1. "The "State of the Art": A Comparative Analysis of Newspaper Digitization to Date" (PDF). Center for Research Libraries . 10 April 2015. Retrieved 22 April 2024.
  2. "Best Practices for Newspaper Digitization, chapter 4 in Best Practices for Creating Digital Collections". University of Illinois at Urbana-Champaign . 2010. Archived from the original on 23 May 2013.