Data preservation

Last updated

Data preservation is the act of conserving and maintaining both the safety and integrity of data. Preservation is done through formal activities that are governed by policies, regulations and strategies directed towards protecting and prolonging the existence and authenticity of data and its metadata. [1] Data can be described as the elements or units in which knowledge and information is created, [2] and metadata are the summarizing subsets of the elements of data; or the data about the data. [3] The main goal of data preservation is to protect data from being lost or destroyed and to contribute to the reuse and progression of the data.

Contents

History

Most historical data collected over time has been lost or destroyed. War and natural disasters combined with the lack of materials and necessary practices to preserve and protect data has caused this. Usually, only the most important data sets were saved, such as government records and statistics, legal contracts and economic transactions. Scientific research and doctoral theses data have mostly been destroyed from improper storage and lack of data preservation awareness and execution. [4] Over time, data preservation has evolved and has generated importance and awareness. We now have many different ways to preserve data and many different important organizations involved in doing so.

The first digital data preservation storage solutions appeared in the 1950s, which were usually flat or hierarchically structured. [5] While there were still issues with these solutions, it made storing data much cheaper, and more easily accessible. In the 1970s relational databases as well as spreadsheets appeared. Relational data bases structure data into tables using structured query languages which made them more efficient than the preceding storage solutions, and spreadsheets hold high volumes of numeric data which can be applied to these relational databases to produce derivative data. More recently, non-relational (non-structured query language) databases have appeared as complements to relational databases which hold high volumes of unstructured or semi-structured data. [4]

Importance

The scope of data preservation is vast. Everything from governmental to business records to art essentially can be represented as data, and is amenable to be lost. This then leads to loss of human history, for perpetuity.

Data can be lost on a small or independent scale whether it's personal data loss, or data loss within businesses and organizations, as well as on a larger or national or global scale which can negatively and potentially permanently affect things such as environmental protection, medical research, homeland security, public health and safety, economic development [6] and culture. The mechanisms of data loss are also as many as they are varied, spanning from disaster, wars, data breaches, negligence, all the way through simple forgetting to natural decay.

Ways in which data collections can be used when preserved and stored properly can be seen through the U.S. Geological Survey, which stores data collections on natural hazards, natural resources, and landscapes. The data collected by the Survey is used by federal and state land management agencies towards land use planning and management, and continually needs access to historical reference data. [6]

In contrast, data holdings are collections of gathered data that are informally kept, and not necessarily prepared for long-term preservation. For example, a collection or back-up of personal files. Data holdings are generally the storage methods used in the past when data has been lost due to environmental and other historical disasters. [4]

Furthermore, data retention differs from data preservation in the sense that by definition, to retain an object (data) is to hold or keep possession or use of the object. [7] To preserve an object is to protect, maintain and keep up for future use. [8] Retention policies often circle around when data should be deleted on purpose as well, and held from public access, while preservation prioritizes permanence and more widely-shared access.

Thus, data preservation exceeds the concept of having or possessing data or back up copies of data. Data preservation ensures reliable access to data by including back-up and recovery mechanisms that precede the event of a disaster or technological change. [9]

Methods

Digital

Digital preservation, is similar to data preservation, but is mainly concerned with technological threats, and solely digital data. Essentially digital data is a set of formal activities to enable ongoing or persistent use and access of digital data exceeding the occurrence of technological malfunction or change. [10] Digital preservation is aware of the inevitable change in technology and protocols, and prepares for data will need to be accessible across new types of technologies and platforms while being the integrity of the data and metadata being conserved. [4]

Technology, while providing great process in conserving data that may not have been possible in the past, is also changing at such a quick rate that digital data may not be accessible anymore due to the format being incompatible with new software. Without the use of data preservation much of our existing digital data is at risk. [9]

The majority of methods used towards data preservation today are digital methods, which are so far the most effective methods that exist.

Archives

Archives are a collection of historical documents and records. Archives contribute and work towards the preservation of data by collecting data that is well organized, while providing the appropriate metadata to confirm it. [11]

An example of an important data archive is The LONI Image Data Archive, which is an archive that collects data regarding clinical trials and clinical research studies. [12]

Catalogues, directories and portals

Catalogues, directories and portals are consolidated resources which are kept by individual institutions, and are associated with data archives and holdings. [4] In other words, the data is not presented on the site, but instead might act as metadata and aggregators, and may administer thorough inventories. [13]

Repositories

Repositories are places where data archives and holdings can be accessed and stored. The goal of repositories is to make sure that all requirements and protocols of archives and holdings are being met, and data is being certified to ensure data integrity and user trust. [4]

Single-site Repositories

A repository that holds all data sets on a single site. [4]

An example of a major single-site repository the Data Archiving and Networking Services which is a repository which provides ongoing access to digital research resources for the Netherlands. [14]

Multi-Site Repositories

A repository that hosts data set on multiple institutional sites. [4]

An example of a well known multi-site repository is OpenAIRE which is a repository that hosts research data and publications collaborating all of the EU countries and more. OpenAIRE promotes open scholarship and seeks to improves discover-ability and re-usability of data. [15]

Trusted Digital Repository

A repository that seeks to provide reliable, trusted access over a long period of time. The repository can be single or multi-sited but must cooperate with the Reference Model for an Open Archival Information System, [16] as well as adhere to a set of rules or attributes that contribute to its trust such as having persistent financial responsibility, organizational buoyancy, administrative responsibility security and safety. [4]

An example of a trusted digital repository is The Digital Repository of Ireland (DRI) which is a multi-site repository that hosts Ireland's humanity and social science data sets. [17]

Cyber Infrastructures

Cyber infrastructures which consists of archive collections which are made available through the system of hardware, technologies, software, policies, services and tools. Cyber infrastructures are geared towards the sharing of data supporting peer-to-peer collaborations and a cultural community. [3]

An example of a major cyber-infrastructure is The Canadian Geo-spatial Data Infrastructure which provides access to spatial data in Canada. [18]

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

An institutional repository (IR) is an archive for collecting, preserving, and disseminating digital copies of the intellectual output of an institution, particularly a research institution. Academics also utilize their IRs for archiving published works to increase their visibility and collaboration with other academics However, most of these outputs produced by universities are not effectively accessed and shared by researchers and other stakeholders As a result academics should be involved in the implementation and development of an IR project so that they can learn the benefits and purpose of building an IR.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

<span class="mw-page-title-main">DSpace</span> Repository software package

DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document management systems, the DSpace repository software serves a specific need as a digital archives system, focused on the long-term storage, access and preservation of digital content. The optional DSpace registry lists almost three thousand repositories all over the world.

The California Digital Library (CDL) was founded by the University of California in 1997. Under the leadership of then UC President Richard C. Atkinson, the CDL's original mission was to forge a better system for scholarly information management and improved support for teaching and research. In collaboration with the ten University of California Libraries and other partners, CDL assembled one of the world's largest digital research libraries. CDL facilitates the licensing of online materials and develops shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the country and works in partnership with the UC campuses to bring the treasures of California's libraries, museums, and cultural heritage organizations to the world. CDL continues to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

The term Open Archival Information System refers to the ISO OAIS Reference Model for an OAIS. This reference model is defined by recommendation CCSDS 650.0-B-2 of the Consultative Committee for Space Data Systems; this text is identical to = 57284 ISO 14721:2012. The CCSDS's purview is space agencies, but the OAIS model it developed has proved useful to other organizations and institutions with digital archiving needs. OAIS, known as ISO 14721:2003, is widely accepted and utilized by various organizations and disciplines, both national and international, and was designed to ensure preservation. The OAIS standard, published in 2005, is considered the optimum standard to create and maintain a digital repository over a long period of time.

The National Geospatial Digital Archive (NGDA) is an archive of cartographic information funded by the Library of Congress through the National Digital Information Infrastructure and Preservation Program (NDIIPP) in collaboration with the University of California Santa Barbara, and Stanford University. The purpose of the archive is to collect and preserve geospatial data and images on a national scale, and develop mechanisms for making data available for future generations.

The conservation and restoration of new media art is the study and practice of techniques for sustaining new media art created using from materials such as digital, biological, performative, and other variable media.

Trustworthy Repositories Audit & Certification (TRAC) is a document describing the metrics of an OAIS-compliant digital repository that developed from work done by the OCLC/RLG Programs and National Archives and Records Administration (NARA) task force initiative.

Digital curation is the selection, preservation, maintenance, collection, and archiving of digital assets. Digital curation establishes, maintains, and adds value to repositories of digital data for present and future use. This is often accomplished by archivists, librarians, scientists, historians, and scholars. Enterprises are starting to use digital curation to improve the quality of information and data within their operational and strategic processes. Successful digital curation will mitigate digital obsolescence, keeping the information accessible to users indefinitely. Digital curation includes digital asset management, data curation, digital preservation, and electronic records management.

PREservation Metadata: Implementation Strategies (PREMIS) is the de facto digital preservation metadata standard.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

A digital library, also called an online library, an internet library, a digital repository, a library without walls, or a digital collection, is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.

The Handle System is the Corporation for National Research Initiatives's proprietary registry assigning persistent identifiers, or handles, to information resources, and for resolving "those handles into the information necessary to locate, access, and otherwise make use of the resources".

Database preservation usually involves converting the information stored in a database to a form likely to be accessible in the long term as technology changes, without losing the initial characteristics of the data.

A data infrastructure is a digital infrastructure promoting data sharing and consumption.

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">Digital Repository of Ireland</span>

The Digital Repository of Ireland (DRI) is a digital repository for Ireland's humanities, social science and cultural heritage data. It was designed as an open access infrastructure that allows for interactive use and sustained growth. Three institutions, Royal Irish Academy (RIA), Trinity College Dublin (TCD), and Maynooth, currently manage the repository and implement its policies, guidelines and training. The Department of Education and Skills has primarily funded DRI since 2016 through the Higher Education Authority and the Irish Research Council. As of 2018, DRI is home to over 28,000 items.

References

  1. "Dictionary Definitions". InterPARES 2 Terminology Database. InterPARES2. 2013. Retrieved 21 October 2013.
  2. Kitchin, R (2012). "Conceptualizing Data". The Data Revolution. London: Sage: 1–26.
  3. 1 2 Cyberinfrastructure Council (2007). "Cyberinfrastructure vision for 21st century discovery" (PDF). Washington DC: National Science Foundation.
  4. 1 2 3 4 5 6 7 8 9 Kitchin, R (2012). "Small Data, Data Infrastructures and Data Brokers". The Data Revolution. London: Sage: 27–47.
  5. Driscoll, K (2012). "From punched cards to "big data": a social history of database populism". Communication +1. 1 (4). Retrieved 22 February 2013.
  6. 1 2 Pierce, F.; Steinmetz, J.; Dickinson, T.; McHugh, J. (2010). "The importance of data preservation". The Geological Society of America. Archived from the original on 2017-12-01. Retrieved 2017-11-29.{{cite journal}}: Cite journal requires |journal= (help)
  7. (2017) Retain [Definition]. Marriam-Webster. Retrieved From: https://www.merriam-webster.com/dictionary/retain
  8. (2107) Preserve [Definition]. Marriam-Webster. Retrieved From: https://www.merriam-webster.com/dictionary/preserve
  9. 1 2 Corrado, E.; Sandy, M. (2014). "Digital Preservation for Libraries, Archives, and Museums". Chapter 1. Rowman & Littlefield Publishers: 3–16.{{cite journal}}: Cite journal requires |journal= (help)
  10. "Data Preservation". International Federation of Data Organizations for Social Science. 2012. Archived from the original on 2017-12-01. Retrieved 2017-11-28.
  11. Lauriault, T. P.; Hackett, Y; Kennedy, E (2013). Geo-spatial Data Preservation Primer. Ottawa: Hickling, Aurthurs and Low.
  12. "About Us". LONI Image and Data Archive. 2017.
  13. O'Carroll, A.; Collins, S.; Gallgher, D.; Tang, J.; Webb, S (2013). Caring for the Digital Content, Mapping International Approaches. Dublin: NUI Maynooth, Trinity College Dublin, Royal Irish Academy and Digital Repository of Ireland.
  14. "About DANS". Data Archiving and Networked Services. 2016.
  15. "Project Factsheets". OpenAIRE. 2017.
  16. "The OAIS reference model". www.oclc.org. Archived from the original on 2013-12-13.
  17. "About DRI". Digital Repository of Ireland. 2014–2015.
  18. "Canada's Spatial Data Infrastructure". Government of Canada. 2017.