Durability (database systems)

Last updated

In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in case of failures, [1] including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes. [2]

Contents

Formally, a database system ensures the durability property if it tolerates three types of failures: transaction, system, and media failures. [1] In particular, a transaction fails if its execution is interrupted before all its operations have been processed by the system. [3] These kinds of interruptions can be originated at the transaction level by data-entry errors, operator cancellation, timeout, or application-specific errors, like withdrawing money from a bank account with insufficient funds. [1] At the system level, a failure occurs if the contents of the volatile storage are lost, due, for instance, to system crashes, like out-of-memory events. [3] At the media level, where media means a stable storage that withstands system failures, failures happen when the stable storage, or part of it, is lost. [3] These cases are typically represented by disk failures. [1]

Thus, to be durable, the database system should implement strategies and operations that guarantee that the effects of transactions that have been committed before the failure will survive the event (even by reconstruction), while the changes of incomplete transactions, which have not been committed yet at the time of failure, will be reverted and will not affect the state of the database system. These behaviours are proven to be correct when the execution of transactions has respectively the resilience and recoverability properties. [3]

Mechanisms

A simplified finite state automaton showing possible DBMS after-failure (in red) states and the transitions (in black) that are necessary to return to a running system to achieve durability. Database failure states and relations.png
A simplified finite state automaton showing possible DBMS after-failure (in red) states and the transitions (in black) that are necessary to return to a running system to achieve durability.

In transaction-based systems, the mechanisms that assure durability are historically associated with the concept of reliability of systems, as proposed by Jim Gray in 1981. [1] This concept includes durability, but it also relies on aspects of the atomicity and consistency properties. [4] Specifically, a reliability mechanism requires primitives that explicitly state the beginning, the end, and the rollback of transactions, [1] which are also implied for the other two aforementioned properties. In this article, only the mechanisms strictly related to durability have been considered. These mechanisms are divided into three levels: transaction, system, and media level. This can be seen as well for scenarios where failures could happen and that have to be considered in the design of database systems to address durability. [3]

Transaction level

Durability against failures that occur at transaction level, such as canceled calls and inconsistent actions that may be blocked before committing by constraints and triggers, is guaranteed by the serializability property of the execution of transactions. The state generated by the effects of precedently committed transactions is available in main memory and, thus, is resilient, while the changes carried by non-committed transactions can be undone. In fact, thanks to serializability, they can be discerned from other transactions and, therefore, their changes are discarded. [3] In addition, it is relevant to consider that in-place changes, which overwrite old values without keeping any kind of history are discouraged. [1] There exist multiple approaches that keep track of the history of changes, such as timestamp-based solutions [5] or logging and locking. [1]

System level

At system level, failures happen, by definition, [3] when the contents of the volatile storage are lost. This can occur in events like system crashes or power outages. Existing database systems use volatile storage (i.e. the main memory of the system) for different purposes: some store their whole state and data in it, even without any durability guarantee; others keep the state and the data, or part of them, in memory, but also use the non-volatile storage for data; other systems only keep the state in main memory, while keeping all the data on disk. [6] The reason behind the choice of having volatile storage, which is subject to this type of failure, and non-volatile storage, is found in the performance differences of the existing technologies that are used to implement these kinds of storage. However, the situation is likely to evolve as the popularity of non-volatile memories (NVM) technologies grows. [7]

In systems that include non-volatile storage, durability can be achieved by keeping and flushing an immutable sequential log of the transactions to such non-volatile storage before acknowledging commitment. Thanks to their atomicity property, the transactions can be considered the unit of work in the recovery process that guarantees durability while exploiting the log. In particular, the logging mechanism is called write-ahead log (WAL) and allows durability by buffering changes to the disk before they are synchronized from the main memory. In this way, by reconstruction from the log file, all committed transactions are resilient to system-level failures, because they can be redone. Non-committed transactions, instead, are recoverable, since their operations are logged to non-volatile storage before they effectively modify the state of the database. [8] In this way, the partially executed operations can be undone without affecting the state of the system. After that, those transactions that were incomplete can be redone. Therefore, the transaction log from non-volatile storage can be reprocessed to recreate the system state right before any later system-level failure. It is worth mentioning that logging is done as a combination of tracking data and operations (i.e. transactions) for performance reasons. [9]

Media level

At media level, failure scenarios affect non-volatile storage, like hard disk drives, solid-state drives, and other types of storage hardware components. [8] To guarantee durability at this level, the database system shall rely on stable memory, which is a memory that is completely and ideally failure-resistant. This kind of memory can be achieved with mechanisms of replication and robust writing protocols. [4]

Many tools and technologies are available to provide a logical stable memory, such as the mirroring of disks, and their choice depends on the requirements of the specific applications. [4] In general, replication and redundancy strategies and architectures that behave like stable memory are available at different levels of the technology stack. In this way, even in case of catastrophic events where the storage hardware is damaged, data loss can be prevented. [10] At this level, there is a strong bond between durability and system and data recovery, in the sense that the main goal is to preserve the data, not necessarily in online replicas, but also as offline copies. [4] These last techniques fall into the categories of backup, data loss prevention, and disaster recovery. [11]

Therefore, in case of media failure, the durability of transactions is guaranteed by the ability to reconstruct the state of the database from the log files stored in the stable memory, in any way it was implemented in the database system. [8] There exist several mechanisms to store and reconstruct the state of a database system that improves the performance, both in terms of space and time, compared to managing all the log files created from the beginning of the database system. These mechanisms often include incremental dumping, differential files, and checkpoints. [12]

Distributed databases

In distributed transactions, ensuring durability requires additional mechanisms to preserve a consistent state sequence across all database nodes. This means, for example, that a single node may not be enough to decide to conclude a transaction by committing it. In fact, the resources used in that transaction may be on other nodes, where other transactions are occurring concurrently. Otherwise, in case of failure, if consistency could not be guaranteed, it would be impossible to acknowledge a safe state of the database for recovery. For this reason, all participating nodes must coordinate before a commit can be acknowledged. This is usually done by a two-phase commit protocol. [13]

In addition, in distributed databases, even the protocols for logging and recovery shall address the issues of distributed environments, such as deadlocks, that could prevent the resilience and recoverability of transactions and, thus, durability. [13] A widely adopted family of algorithms that ensures these properties is Algorithms for Recovery and Isolation Exploiting Semantics (ARIES). [8]

See also

Related Research Articles

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

In computer science, ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.

Multiversion concurrency control, is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.

A database transaction symbolizes a unit of work, performed within a database management system against a database, that is treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in a database. Transactions in a database environment have two main purposes:

  1. To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure. For example: when execution prematurely and unexpectedly stops in which case many operations upon a database remain uncompleted, with unclear status.
  2. To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the programs' outcomes are possibly erroneous.

In computer science, Algorithms for Recovery and Isolation Exploiting Semantics, or ARIES is a recovery algorithm designed to work with a no-force, steal database approach; it is used by IBM Db2, Microsoft SQL Server and many other database systems. IBM Fellow Dr. C. Mohan is the primary inventor of the ARIES family of algorithms.

In the field of databases in computer science, a transaction log is a history of actions executed by a database management system used to guarantee ACID properties over crashes or hardware failures. Physically, a log is a file listing changes to the database, stored in a stable storage format.

In database systems, isolation determines how transaction integrity is visible to other users and systems.

In database systems, atomicity is one of the ACID transaction properties. An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright. As a consequence, the transaction cannot be observed to be in progress by another database client. At one moment in time, it has not yet happened, and at the next it has already occurred in whole.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

<span class="mw-page-title-main">HSQLDB</span> Java-based database engine

HSQLDB is a relational database management system written in Java. It has a JDBC driver and supports a large subset of SQL-92, SQL:2008, SQL:2011, and SQL:2016 standards. It offers a fast, small database engine which offers both in-memory and disk-based tables. Both embedded and server modes are available.

The Write Anywhere File Layout (WAFL) is a proprietary file system that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure, and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances like NetApp FAS, AFF, Cloud Volumes ONTAP and ONTAP Select.

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM data storage technology from Microsoft. ESE is the core of Microsoft Exchange Server, Active Directory, and Windows Search. It's also used by a number of Windows components including Windows Update client and Help and Support Center. Its purpose is to allow applications to store and retrieve data via indexed and sequential access.

An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. In-memory databases are faster than disk-optimized databases because disk access is slower than memory access and the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

In concurrency control of databases, transaction processing, and various transactional applications, both centralized and distributed, a transaction schedule is serializable if its outcome is equal to the outcome of its transactions executed serially, i.e. without overlapping in time. Transactions are normally executed concurrently, since this is the most efficient way. Serializability is the major correctness criterion for concurrent transactions' executions. It is considered the highest level of isolation between transactions, and plays an essential role in concurrency control. As such it is supported in all general purpose database systems. Strong strict two-phase locking (SS2PL) is a popular serializability mechanism utilized in most of the database systems since their early days in the 1970s.

Commitment ordering (CO) is a class of interoperable serializability techniques in concurrency control of databases, transaction processing, and related applications. It allows optimistic (non-blocking) implementations. With the proliferation of multi-core processors, CO has also been increasingly utilized in concurrent programming, transactional memory, and software transactional memory (STM) to achieve serializability optimistically. CO is also the name of the resulting transaction schedule (history) property, defined in 1988 with the name dynamic atomicity. In a CO compliant schedule, the chronological order of commitment events of transactions is compatible with the precedence order of the respective transactions. CO is a broad special case of conflict serializability and effective means to achieve global serializability across any collection of database systems that possibly use different concurrency control mechanisms.

sync is a standard system call in the Unix operating system, which commits all data from the kernel filesystem buffers to non-volatile storage, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

In computer science, persistent memory is any method or apparatus for efficiently storing data structures such that they can continue to be accessed using memory instructions or memory APIs even after the end of the process that created or last modified them.

References

  1. 1 2 3 4 5 6 7 8 Gray, Jim (1981). "The transaction concept: Virtues and limitations" (PDF). VLDB. 81: 144–154.
  2. "ACID Compliance: What It Means and Why You Should Care". MariaDB. 29 July 2018. Retrieved 22 September 2021.
  3. 1 2 3 4 5 6 7 Hadzilacos, Vassos (1988). "A theory of reliability in database systems". Journal of the ACM. 35 (1): 121–145. doi: 10.1145/42267.42272 . ISSN   0004-5411. S2CID   7052304.
  4. 1 2 3 4 Atzeni, Paolo, ed. (1999). Database systems: concepts, languages & architectures. New York: McGraw-Hill. pp. 311–320. ISBN   978-0-07-709500-0.
  5. Svobodova, L. (1980). "MANAGEMENT OF OBJECT HISTORIES IN THE SWALLOW REPOSITORY". Mit/LCS Tr-243. USA.
  6. Petrov, Oleksandr (2019). Database internals: a deep dive into how distributed data systems work (1st ed.). Beijing Boston Farnham Sebastopol Tokyo: O'Reilly. pp. 40–42. ISBN   978-1-4920-4034-7.
  7. Arulraj, Joy; Pavlo, Andrew (2017-05-09). "How to Build a Non-Volatile Memory Database Management System". Proceedings of the 2017 ACM International Conference on Management of Data. SIGMOD '17. New York, NY, USA: Association for Computing Machinery. pp. 1753–1758. doi:10.1145/3035918.3054780. ISBN   978-1-4503-4197-4. S2CID   648876.
  8. 1 2 3 4 Petrov, Oleksandr (2019). Database internals: a deep dive into how distributed data systems work (1st ed.). Beijing Boston Farnham Sebastopol Tokyo: O'Reilly. pp. 185–195. ISBN   978-1-4920-4034-7.
  9. Mohan, C.; Haderle, Don; Lindsay, Bruce; Pirahesh, Hamid; Schwarz, Peter (1992-03-01). "ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging". ACM Transactions on Database Systems. 17 (1): 94–162. doi: 10.1145/128765.128770 . ISSN   0362-5915. S2CID   8759704.
  10. Eich, Margaret H. (1987-02-01). "A classification and comparison of main memory database recovery techniques". 1987 IEEE Third International Conference on Data Engineering. IEEE. pp. 332–339. doi:10.1109/ICDE.1987.7272398. ISBN   978-0-8186-0762-2. S2CID   207773738.
  11. Choy, Manhoi; Leong, Hong Va; Wong, Man Hon (2000). "Disaster recovery techniques for database systems". Communications of the ACM. 43 (11es): 6. doi:10.1145/352515.352521. ISSN   0001-0782. S2CID   14781378.
  12. Verhofstad, Joost S. M. (1978-06-01). "Recovery Techniques for Database Systems". ACM Computing Surveys. 10 (2): 167–195. doi:10.1145/356725.356730. S2CID   8847522.
  13. 1 2 Mohan, C.; Haderle, Don; Lindsay, Bruce; Pirahesh, Hamid; Schwarz, Peter (1992-03-01). "ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging". ACM Transactions on Database Systems. 17 (1): 94–162. doi: 10.1145/128765.128770 . ISSN   0362-5915. S2CID   8759704.

Further reading