Algorithms for Recovery and Isolation Exploiting Semantics

Last updated

In computer science, Algorithms for Recovery and Isolation Exploiting Semantics, or ARIES, is a recovery algorithm designed to work with a no-force, steal database approach; it is used by IBM Db2, Microsoft SQL Server and many other database systems. [1] IBM Fellow Dr. C. Mohan is the primary inventor of the ARIES family of algorithms. [2]

Contents

Three main principles lie behind ARIES:

Logging

The ARIES algorithm relies on logging of all database operations with ascending Sequence Numbers. Usually the resulting logfile is stored on so-called "stable storage", that is a storage medium that is assumed to survive crashes and hardware failures.

To gather the necessary information for the logs, two data structures have to be maintained: the dirty page table (DPT) and the transaction table (TT).

The dirty page table keeps record of all the pages that have been modified, and not yet written to disk, and the first Sequence Number that caused that page to become dirty. The transaction table contains all currently running transactions and the Sequence Number of the last log entry they created.

We create log records of the form (Sequence Number, Transaction ID, Page ID, Redo, Undo, Previous Sequence Number). The Redo and Undo fields keep information about the changes this log record saves and how to undo them. The Previous Sequence Number is a reference to the previous log record that was created for this transaction. In the case of an aborted transaction, it's possible to traverse the log file in reverse order using the Previous Sequence Numbers, undoing all actions taken within the specific transaction.

Every transaction implicitly begins with the first "Update" type of entry for the given Transaction ID, and is committed with "End Of Log" (EOL) entry for the transaction.

During a recovery, or while undoing the actions of an aborted transaction, a special kind of log record is written, the Compensation Log Record (CLR), to record that the action has already been undone. CLRs are of the form (Sequence Number, Transaction ID, Page ID, Redo, Previous Sequence Number, Next Undo Sequence Number). The Redo field contains application of Undo field of reverted action, and the Undo field is omitted because CLR is never reverted.

Recovery

The recovery works in three phases. The first phase, Analysis, computes all the necessary information from the logfile. The Redo phase restores the database to the exact state at the crash, including all the changes of uncommitted transactions that were running at that point in time. The Undo phase then undoes all uncommitted changes, leaving the database in a consistent state.

Analysis

During the Analysis phase we restore the DPT and the TT as they were at the time of the crash.

We run through the logfile (from the beginning or the last checkpoint) and add all transactions for which we encounter Begin Transaction entries to the TT. Whenever an End Log entry is found, the corresponding transaction is removed. The last Sequence Number for each transaction is of course also maintained.

During the same run we also fill the dirty page table by adding a new entry whenever we encounter a page that is modified and not yet in the DPT. This however only computes a superset of all dirty pages at the time of the crash, since we don't check the actual database file whether the page was written back to the storage.

Redo

From the DPT, we can compute the minimal Sequence Number of a dirty page. From there, we have to start redoing the actions until the crash, in case they weren't persisted already.

Running through the log file, we check for each entry, whether the modified page P on the entry exists in the DPT. If it doesn't, then we do not have to worry about redoing this entry since the data persists on the disk. If page P exists in the DPT table, then we see whether the Sequence Number in the DPT is smaller than the Sequence Number of the log record (i.e. whether the change in the log is newer than the last version that was persisted). If it isn't, then we don't redo the entry since the change is already there. If it is, we fetch the page from the database storage and check the Sequence Number stored on the page to the Sequence Number on the log record. If the former is smaller than the latter, the page needs to be written to the disk. That check is necessary because the recovered DPT is only a conservative superset of the pages that really need changes to be reapplied. Lastly, when all the above checks are finished and failed, we reapply the redo action and store the new Sequence Number on the page. It is also important for recovery from a crash during the Redo phase, as the redo isn't applied twice to the same page.

Undo

After the Redo phase, the database reflects the exact state at the crash. However the changes of uncommitted transactions have to be undone to restore the database to a consistent state.

For that we run backwards through the log for each transaction in the TT (those runs can of course be combined into one) using the Previous Sequence Number fields in the records. For each record we undo the changes (using the information in the Undo field) and write a compensation log record to the log file. If we encounter a Begin Transaction record we write an End Log record for that transaction.

The compensation log records make it possible to recover during a crash that occurs during the recovery phase. That isn't as uncommon as one might think, as it is possible for the recovery phase to take quite long. CLRs are read during the Analysis phase and redone during the Redo phase.

Checkpoints

To avoid re-scanning the whole logfile during the analysis phase it is advisable to save the DPT and the TT regularly to the logfile, forming a checkpoint. Instead of having to run through the whole file it is just necessary to run backwards until a checkpoint is found. From that point it is possible to restore the DPT and the TT as they were at the time of the crash by reading the logfile forward again. Then it is possible to proceed as usual with Redo and Undo.

The naive way for checkpointing involves locking the whole database to avoid changes to the DPT and the TT during the creation of the checkpoint. Fuzzy logging circumvents that by writing two log records. One Fuzzy Log Starts Here record and, after preparing the checkpoint data, the actual checkpoint. Between the two records other log records can be created. During recovery it is necessary to find both records to obtain a valid checkpoint.

Related Research Articles

In computer science, ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

In computer science, transaction processing is information processing that is divided into individual, indivisible operations called transactions. Each transaction must succeed or fail as a complete unit; it can never be only partially complete.

In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.

Multiversion concurrency control, is a non-locking concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.

In computer science, write-ahead logging (WAL) is a family of techniques for providing atomicity and durability in database systems. It can be seen as an implementation of the "Event Sourcing" architecture, in which the state of a system is the result of the evolution of incoming events from an initial state. A write ahead log is an append-only auxiliary disk-resident structure used for crash and transaction recovery. The changes are first recorded in the log, which must be written to stable storage, before the changes are written to the database.

In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in case of failures, including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.

In the field of databases in computer science, a transaction log is a history of actions executed by a database management system used to guarantee ACID properties over crashes or hardware failures. Physically, a log is a file listing changes to the database, stored in a stable storage format.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

A distributed transaction is a database transaction in which two or more network hosts are involved. Usually, hosts provide transactional resources, while a transaction manager creates and manages a global transaction that encompasses all operations against such resources. Distributed transactions, as any other transactions, must have all four ACID properties, where atomicity guarantees all-or-nothing outcomes for the unit of work.

In transaction processing, databases, and computer networking, the two-phase commit protocol is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort the transaction. This protocol achieves its goal even in many cases of temporary system failure, and is thus widely used. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome. To accommodate recovery from failure the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.

In database technologies, a rollback is an operation which returns the database to some previous state. Rollbacks are important for database integrity, because they mean that the database can be restored to a clean copy even after erroneous operations are performed. They are crucial for recovering from database server crashes; by rolling back any transaction which was active at the time of the crash, the database is restored to a consistent state.

In computer science and data management, a commit is the making of a set of tentative changes permanent, marking the end of a transaction and providing Durability to ACID transactions. A commit is an act of committing. The record of commits is called the commit log.

In the database structured query language (SQL), the DELETE statement is used to remove one or more records from a table. A subset may be defined for deletion using a condition, otherwise all records are removed. Some database management systems (DBMSs), like MySQL, allow deletion of rows from multiple tables with one DELETE statement.

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM data storage technology from Microsoft. ESE is the core of Microsoft Exchange Server, Active Directory, and Windows Search. It is also used by a number of Windows components including Windows Update client and Help and Support Center. Its purpose is to allow applications to store and retrieve data via indexed and sequential access.

Multi-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. All members are responsive to client data queries. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group and resolving any conflicts that might arise between concurrent changes made by different members.

In computer science, persistence refers to the characteristic of state of a system that outlives the process that created it. This is achieved in practice by storing the state as data in computer data storage. Programs have to transfer data to and from storage devices and have to provide mappings from the native programming-language data structures to the storage device data structures.

Data inconsistency refers to whether the same data kept at different places do or do not match.

In computing, logging is the act of keeping a log of events that occur in a computer system, such as problems, errors or just information on current operations. These events may occur in the operating system or in other software. A message or log entry is recorded for each such event. These log messages can then be used to monitor and understand the operation of the system, to debug problems, or during an audit. Logging is particularly important in multi-user software, to have a central overview of the operation of the system.

In the Oracle RDBMS environment, redo logs comprise files in a proprietary format which log a history of all changes made to the database. Each redo log file consists of redo records. A redo record, also called a redo entry, holds a group of change vectors, each of which describes or represents a change made to a single block in the database.

Common Log File System (CLFS) is a general-purpose logging subsystem that is accessible to both kernel-mode as well as user-mode applications for building high-performance transaction logs. It was introduced with Windows Server 2003 R2 and included in later Windows operating systems. CLFS can be used for both data logging as well as for event logging. CLFS is used by TxF and TxR to store transactional state changes before they commit a transaction. Binary Log File(s) created from CLFS can not be viewed by any integrated Windows tool.

References

  1. Mohan, C.; Haderle, Donald; Lindsay, Bruce; Pirahesh, Hamid; Schwarz, Peter (March 1992). "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging". ACM Transactions on Database Systems. 17 (1): 94–162. doi: 10.1145/128765.128770 .
  2. "Repeating History Beyond ARIES" (PDF). C. Mohan, Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, UK, September 1999.