Database preservation usually involves converting the information stored in a database to a form likely to be accessible in the long term as technology changes, without losing the initial characteristics (context, content, structure, appearance and behaviour) of the data. [1]
With the prevalence of databases, different methods have been developed to aid in the preservation of databases and their contents. These methods vary depending on database characteristics and preservation needs. [2]
There are three basic methods of database preservation: migration, XML, and emulation. [1] There are also certain tools, software, and projects which have been created to aid in the preservation of databases including SIARD, the Digital Preservation Toolkit, CHRONOS, and RODA.
The characteristics of the database itself are taken into consideration when attempting preservation of said database. Relational databases are made up of tables which contain data in records and these tables then connect to one another through common data points that are stored in their records. [3] However, with the emergence of big data the new NoSQL database is also coming into play. [4] Databases are characterized as open or closed and static or dynamic. When a database is considered to be open it means it is open to additional data being added, however when a database is considered to be closed it means the opposite—that it is closed to new data because of its completed nature. A database is considered to be static when it contains records that are not edited or changed after their initial inclusion, however a database is considered to be dynamic when it contains records that may be edited in the future. Whether a database is open and static, open and dynamic, closed and static, or closed and dynamic will affect the methods used for preservation. It is more difficult to preserve a dynamic database than a static one because the data is constantly changing, and it is more difficult to preserve an open database than a closed one because data is constantly being added. The more often a database changes, either within a record or by adding a record, the more often steps must be taken to capture that change for preservation. [2]
Three core methods of digital preservation can be applied to the preservation of databases as well. These methods include migration, XML, and emulation. [1]
The migration method (also known as inactive archiving) [3] involves transferring data from an obsolete database program to a newer format. There are three methods of migration: backward compatibility, interoperability, and conversion to standards. Backward compatibility involves utilizing newer software or hardware versions to open, access, and read a document which was made using an older version. Interoperability involves decreasing the possibility of obsolescence by ensuring a particular file can be accessed with more than one combination of software and hardware. Conversion to standards involves transferring data storage from a proprietary format to an open, more readily accessible, and widely used format. [1]
The XML method (also known as XML normalization) [3] involves converting original database information to the XML standard format. XML as a format does not require a particular hardware or software (beyond a text editor or word processor) and is both human and machine readable, making it a sustainable format for preservation and storage purposes. [1] However, in converting data to XML format, certain interactive functionality of the database, such as the ability to query, is lost. [3]
The emulation method involves recreating an older computing environment with newer technologies and software. This allows obsolete software, hardware, or file formats to remain accessible on new systems. Therefore, an outdated database could be run on an emulator which mimics the environment that the database was originally created in. [1]
Version 1.0 of the Software Independent Archiving of Relational Databases (SIARD) format was developed by the Swiss Federal Archives in 2007. It was designed for archiving relational databases in a vendor-neutral form. A SIARD archive is a ZIP-based package of files based on XML and SQL:1999. A SIARD file incorporates both the database content and also machine-processable structural metadata that records the structure of database tables and their relationships. The ZIP file contains an XML file describing the database structure (metadata.xml) as well as a collection of XML files, one per table, capturing the table content. The SIARD archive may also contain text files and binary files representing database large objects (BLOBs and CLOBs). SIARD permits direct access to individual tables by exploring with ZIP tools. A SIARD archive is not an operational database but supports re-integration of the archived database into another relational database management system (RDBMS) that supports SQL:1999. In addition, SIARD supports the addition of descriptive and contextual metadata that is not recorded in the database itself and the embedding of documentation files in the archive. [5] SIARD Version 1.0 was homologized as standard eCH-0165 in 2013. [6]
Version 2.0 of the SIARD preservation format was designed and developed by the Swiss Federal Archives under the auspices of the E-ARK project. [7] Version 2.0 is based on version 1.0 and defines a format that is backwards-compatible with version 1.0. New features in version 2.0 include:
A XML schema was created by researcher José Carlos Ramalho from the University of Minho to capture table information and data from a relational database. It was published in 2007. [8]
CHRONOS is a software product which serves as a database preservation tool. [4] CSP Chronos Archiving represents one proprietary solution for database preservation. CHRONOS was developed from 2004 to 2006 by CSP in partnership with the University of Applied Sciences Landshut's department of computer science. [4] [9] CHRONOS pulls data from a database management system and stores it in a CHRONOS archive as text or XML files. All data can therefore be accessed and read without a Database Management System (DBMS), or CHRONOS itself, as it is in plain text format. This eliminates the need for maintaining a DBMS solely for reading preserved static databases as well as the need to, potentially riskily, migrate database files to newer database formats. [9] Although CHRONOS stores data in plain text format, its querying capabilities, are considered comparable to that of a relational database. [4]
A series of steps, created by the RODA project to ingest and preserve relational databases in a normalized format, represent the Database Preservation Toolkit or dbtoolkit: an instrument designed for the preservation and access of archived databases. Using the Database Preservation Toolkit, to achieve normalization of relational databases, data is converted to DBML (Database Markup Language) or SIARD, as both utilize XML, a standard format which does not require specific or proprietary software or hardware—ideal for a preservation format. [10]
The Database Preservation Toolkit (DBPTK) allows conversion between database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as SIARD, an XML-based format created for the purpose of database preservation. In this conversion process the toolkit extracts unique DBMS information using DBMS-specific connectors. These connectors pair with a particular DBMS, extract its data, and represent it in XML form which then leads to representation in DBML and SIARD. New connectors can also be created for the ingestion of new DBMS’. [10] The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.
This toolkit was originally part of the RODA project [11] and then released on its own. It has been further developed in the E-ARK project together with a new version of the SIARD preservation format.
The toolkit uses input and output modules. Each module supports read and/or write to a particular database format or live system. New modules can easily be added by implementation of a new interface and adding new drivers. [12]
Research projects this regard include:
RODA, or the Repository of Authentic Digital Objects, was a project launched in Portugal in 2006 by the Portuguese National Archives, in order to preserve those digital objects produced by Portugal’s government institutions. The project aimed to combine several types of digital objects into one repository including relational databases. As a singular repository of many differing types of digital objects, RODA aims to normalize all ingested objects, that is to minimize the format types utilized to store documents and to preserve like documents in like formats. [10]
The RODA project emphasized the creation of a standardized method for preserving databases as digital objects. Database preservation poses a unique challenge in that the preservation process is split into three layers: data, structure (logic), and semantics (interface). [17] That is, it was determined that the databases’ data, as well as its structure and semantics, need to be preserved. In order to preserve all three of these elements, the RODA project developed the Database Preservation Toolkit. [10]
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.
Microsoft Access is a database management system (DBMS) from Microsoft that combines the relational Access Database Engine (ACE) with a graphical user interface and software-development tools. It is a member of the Microsoft 365 suite of applications, included in the Professional and higher editions or sold separately.
An object database or object-oriented database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are different from relational databases which are table-oriented. A third type, object–relational databases, is a hybrid of both approaches. Object databases have been considered since the early 1980s.
Object–relational mapping in computer science is a programming technique for converting data between a relational database and the heap of an object-oriented programming language. This creates, in effect, a virtual object database that can be used from within the programming language.
An object–relational database (ORD), or object–relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the data model with custom data types and methods.
Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.
Hibernate ORM is an object–relational mapping tool for the Java programming language. It provides a framework for mapping an object-oriented domain model to a relational database. Hibernate handles object–relational impedance mismatch problems by replacing direct, persistent database accesses with high-level object handling functions.
An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.
MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.
Object–relational impedance mismatch is a set of difficulties going between data in relational data stores and data in domain-driven object models. Relational Database Management Systems (RDBMS) is the standard method for storing data in a dedicated database, while object-orientated (OO) programming is the default method for business-centric design in programming languages. The problem lies in neither relational databases nor OO programming, but in the conceptual difficulty mapping between the two logic models. Both logical models are differently implementable using database servers, programming languages, design patterns, or other technologies. Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.
Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional relational database management system (RDBMS), object–relational database (ORDBMS), virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is a "universal server"; it enables a single multithreaded server process that implements multiple protocols. The free and open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso. The software has been developed by OpenLink Software with Kingsley Uyi Idehen and Orri Erling as the chief software architects.
A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.
Polyhedra is a family of relational database management systems offered by ENEA AB, a Swedish company. The original version of Polyhedra was an in-memory database management system which could be used in high availability configurations; in 2006 Polyhedra Flash DBMS was introduced to allow databases to be stored in flash memory. All versions employ the client–server model to ensure the data are protected from misbehaving application software, and they use the same SQL, ODBC and type-4 JDBC interfaces. Polyhedra is targeted primarily for embedded use by Original Equipment Manufacturers (OEMs), and big-name customers include Ericsson, ABB, Emerson, Lockheed Martin, United Utilities and Siemens AG.
Drizzle is a discontinued free software/open-source relational database management system (DBMS) that was forked from the now-defunct 6.0 development branch of the MySQL DBMS.
Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.
Apache Empire-db is a Java library that provides a high level object-oriented API for accessing relational database management systems (RDBMS) through JDBC. Apache Empire-db is open source and provided under the Apache License 2.0 from the Apache Software Foundation.
Xena is open-source software for use in digital preservation. Xena is short for XML Electronic Normalising for Archives.
The following is provided as an overview of and topical guide to databases:
Data preservation is the act of conserving and maintaining both the safety and integrity of data. Preservation is done through formal activities that are governed by policies, regulations and strategies directed towards protecting and prolonging the existence and authenticity of data and its metadata. Data can be described as the elements or units in which knowledge and information is created, and metadata are the summarizing subsets of the elements of data; or the data about the data. The main goal of data preservation is to protect data from being lost or destroyed and to contribute to the reuse and progression of the data.