Multiversion concurrency control

Last updated January 12, 2025

Multiversion concurrency control (MCC or MVCC), is a non-locking concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.^[1]

Description

Without concurrency control, if someone is reading from a database at the same time as someone else is writing to it, it is possible that the reader will see a half-written or inconsistent piece of data. For instance, when making a wire transfer between two bank accounts if a reader reads the balance at the bank when the money has been withdrawn from the original account and before it was deposited in the destination account, it would seem that money has disappeared from the bank. Isolation is the property that provides guarantees in the concurrent accesses to data. Isolation is implemented by means of a concurrency control protocol. The simplest way is to make all readers wait until the writer is done, which is known as a read-write lock. Locks are known to create contention especially between long read transactions and update transactions. MVCC aims at solving the problem by keeping multiple copies of each data item. In this way, each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the changes have been completed (or, in database terms: until the transaction has been committed.)

When an MVCC database needs to update a piece of data, it will not overwrite the original data item with new data, but instead creates a newer version of the data item. Thus there are multiple versions stored. The version that each transaction sees depends on the isolation level implemented. The most common isolation level implemented with MVCC is snapshot isolation. With snapshot isolation, a transaction observes a state of the data as of when the transaction started.

MVCC provides point-in-time consistent views. Read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and read these versions of the data. Read and write transactions are thus isolated from each other without any need for locking. However, despite locks being unnecessary, they are used by some MVCC databases such as Oracle. Writes create a newer version, while concurrent reads access an older version.

MVCC introduces the challenge of how to remove versions that become obsolete and will never be read. In some cases, a process to periodically sweep through and delete the obsolete versions is implemented. This is often a stop-the-world process that traverses a whole table and rewrites it with the last version of each data item. PostgreSQL can use this approach with its VACUUM FREEZE process. Other databases split the storage blocks into two parts: the data part and an undo log. The data part always keeps the last committed version. The undo log enables the recreation of older versions of data. The main inherent limitation of this latter approach is that when there are update-intensive workloads, the undo log part runs out of space and then transactions are aborted as they cannot be given their snapshot. For a document-oriented database it also allows the system to optimize documents by writing entire documents onto contiguous sections of disk—when updated, the entire document can be re-written rather than bits and pieces cut out or maintained in a linked, non-contiguous database structure.

Implementation

MVCC uses timestamps (TS), and incrementing transaction IDs, to achieve transactional consistency. MVCC ensures a transaction (T) never has to wait to Read a database object (P) by maintaining several versions of the object. Each version of object P has both a Read Timestamp (RTS) and a Write Timestamp (WTS) which lets a particular transaction T_i read the most recent version of the object which precedes the transaction's Read TimestampRTS(T_i).

If transaction T_i wants to Write to object P, and there is also another transaction T_k happening to the same object, the Read Timestamp RTS(T_i) must precede the Read Timestamp RTS(T_k), i.e., RTS(T_i) < RTS(T_k)^{[ clarification needed ]}, for the object Write Operation (WTS) to succeed. A Write cannot complete if there are other outstanding transactions with an earlier Read Timestamp (RTS) to the same object. Like standing in line at the store, you cannot complete your checkout transaction until those in front of you have completed theirs.

To restate; every object (P) has a Timestamp (TS), however if transaction T_i wants to Write to an object, and the transaction has a Timestamp (TS) that is earlier than the object's current Read Timestamp, TS(T_i) < RTS(P), then the transaction is aborted and restarted. (This is because a later transaction already depends on the old value.) Otherwise, T_i creates a new version of object P and sets the read/write timestamp TS of the new version to the timestamp of the transaction TS ← TS(T_i).^[2]

The drawback to this system is the cost of storing multiple versions of objects in the database. On the other hand, reads are never blocked, which can be important for workloads mostly involving reading values from the database. MVCC is particularly adept at implementing true snapshot isolation, something which other methods of concurrency control frequently do either incompletely or with high performance costs.

A structure to hold a record (row) for a database using MVCC could look like this in Rust.

structRecord{/// Insert transaction identifier stamp.insert_transaction_id: u32,/// Delete transaction identifier stamp.delete_transaction_id: u32,/// The length of the data.data_length: u16,/// The content of the record.data: Vec<u8>,}

Record
Offset	Octet	0								1								2								3
Octet	Bit	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
0	0	Insert transaction identifier
4	32	Delete transaction identifier
8	64	Data length
10	80	Data…
14	112
⋮	⋮

Insert transaction identifier: 32 bits:The MVCC transaction identifier for insert.
Delete transaction identifier: 32 bits:The MVCC transaction identifier for delete.
Data length: 16 bits:The length of the data.
Data: Variable:The content stored in the record.

Examples

Concurrent read–write

At Time = 1, the state of a database could be:

Time	Object 1	Object 2
0	"Foo" by T0	"Bar" by T0
1	"Hello" by T1

T0 wrote Object 1="Foo" and Object 2="Bar". After that T1 wrote Object 1="Hello" leaving Object 2 at its original value. The new value of Object 1 will supersede the value at 0 for all transactions that start after T1 commits at which point version 0 of Object 1 can be garbage collected.

If a long running transaction T2 starts a read operation of Object 2 and Object 1 after T1 committed and there is a concurrent update transaction T3 which deletes Object 2 and adds Object 3="Foo-Bar", the database state will look like this at time 2:

Time	Object 1	Object 2	Object 3
0	"Foo" by T0	"Bar" by T0
1	"Hello" by T1
2		(deleted) by T3	"Foo-Bar" by T3

There is a new version as of time 2 of Object 2 which is marked as deleted and a new Object 3. Since T2 and T3 run concurrently T2 sees the version of the database before 2 i.e. before T3 committed writes, as such T2 reads Object 2="Bar" and Object 1="Hello". This is how multiversion concurrency control allows snapshot isolation reads without any locks.

History

Multiversion concurrency control is described in some detail in the 1981 paper "Concurrency Control in Distributed Database Systems"^[3] by Phil Bernstein and Nathan Goodman, then employed by the Computer Corporation of America. Bernstein and Goodman's paper cites a 1978 dissertation^[4] by David P. Reed which quite clearly describes MVCC and claims it as an original work.

The first shipping, commercial database software product featuring MVCC was VAX Rdb/ELN, released in 1984,^[5] and created at Digital Equipment Corporation by Jim Starkey. Starkey went on^[6] to create the second commercially successful MVCC database - InterBase.^[7]

Related Research Articles

In computer science, ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

In computer science, particularly the field of databases, the Thomas write rule is a rule in timestamp-based concurrency control. It can be summarized as ignore outdated writes.

In computer science, a timestamp-based concurrency control algorithm is a optimistic concurrency control method. It is used in some databases to safely handle transactions using timestamps.

In Computer Science, in the field of databases, non-lock concurrency control is a concurrency control method used in relational databases without using locking.

In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.

In computer science, in the field of databases, write–write conflict, also known as overwriting uncommitted data is a computational anomaly associated with interleaved execution of transactions. Specifically, a write–write conflict occurs when "transaction requests to write an entity for which an unclosed transaction has already made a write request."

In computer science, in the field of databases, read–write conflict, also known as unrepeatable reads, is a computational anomaly associated with interleaved execution of transactions. Specifically, a read–write conflict occurs when a "transaction requests to read an entity for which an unclosed transaction has already made a write request."

Optimistic concurrency control (OCC), also known as optimistic locking, is a non-locking concurrency control method applied to transactional systems such as relational database management systems and software transactional memory. OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted. Optimistic concurrency control was first proposed in 1979 by H. T. Kung and John T. Robinson.

A database transaction symbolizes a unit of work, performed within a database management system against a database, that is treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in a database. Transactions in a database environment have two main purposes:

To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure. For example: when execution prematurely and unexpectedly stops in which case many operations upon a database remain uncompleted, with unclear status.
To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the programs' outcomes are possibly erroneous.

In database systems, isolation is one of the ACID transaction properties. It determines how transaction integrity is visible to other users and systems. A lower isolation level increases the ability of many users to access the same data at the same time, but also increases the number of concurrency effects users might encounter. Conversely, a higher isolation level reduces the types of concurrency effects that users may encounter, but requires more system resources and increases the chances that one transaction will block another.

In the fields of databases and transaction processing, a schedule of a system is an abstract model to describe the order of executions in a set of transactions running in the system. Often it is a list of operations (actions) ordered by time, performed by a set of transactions that are executed together in the system. If the order in time between certain operations is not determined by the system, then a partial order is used. Examples of such operations are requesting a read operation, reading, writing, aborting, committing, requesting a lock, locking, etc. Often, only a subset of the transaction operation types are included in a schedule.

In database technologies, a rollback is an operation which returns the database to some previous state. Rollbacks are important for database integrity, because they mean that the database can be restored to a clean copy even after erroneous operations are performed. They are crucial for recovering from database server crashes; by rolling back any transaction which was active at the time of the crash, the database is restored to a consistent state.

<span class="mw-page-title-main">HSQLDB</span> Java-based database engine

HSQLDB is a relational database management system written in Java. It has a JDBC driver and supports a large subset of SQL-92, SQL:2008, SQL:2011, and SQL:2016 standards. It offers a fast, small database engine which offers both in-memory and disk-based tables. Both embedded and server modes are available.

In computer science, software transactional memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It is an alternative to lock-based synchronization. STM is a strategy implemented in software, rather than as a hardware component. A transaction in this context occurs when a piece of code executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions. The idea of providing hardware support for transactions originated in a 1986 paper by Tom Knight. The idea was popularized by Maurice Herlihy and J. Eliot B. Moss. In 1995, Nir Shavit and Dan Touitou extended this idea to software-only transactional memory (STM). Since 2005, STM has been the focus of intense research and support for practical implementations is growing.

The Zope Object Database (ZODB) is an object-oriented database for transparently and persistently storing Python objects. It is included as part of the Zope web application server, but can also be used independently of Zope.

Commitment ordering (CO) is a class of interoperable serializability techniques in concurrency control of databases, transaction processing, and related applications. It allows optimistic (non-blocking) implementations. With the proliferation of multi-core processors, CO has also been increasingly utilized in concurrent programming, transactional memory, and software transactional memory (STM) to achieve serializability optimistically. CO is also the name of the resulting transaction schedule (history) property, defined in 1988 with the name dynamic atomicity. In a CO compliant schedule, the chronological order of commitment events of transactions is compatible with the precedence order of the respective transactions. CO is a broad special case of conflict serializability and effective means to achieve global serializability across any collection of database systems that possibly use different concurrency control mechanisms.

In databases, and transaction processing, snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database, and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.

A precedence graph, also named conflict graph and serializability graph, is used in the context of concurrency control in databases. It is the directed graph representing precedence of transactions in the schedule, as reflected by precedence of conflicting operations in the transactions. A schedule is conflict-serializable if and only if its precedence graph of committed transactions is acyclic.

NoSQLz is a consistent key-value big data store for z/OS IBM systems. It was developed by systems programmer Thierry Falissard in 2013. The purpose was to provide a low-cost alternative to all proprietary mainframe DBMS.

References

↑ "Clojure - Refs and Transactions". clojure.org. Retrieved 2019-04-12.
↑ Ramakrishnan, R., & Gehrke, J. (2000). Database management systems. Osborne/McGraw-Hill.
↑ Bernstein, Philip A.; Goodman, Nathan (1981). "Concurrency Control in Distributed Database Systems". ACM Computing Surveys. 13 (2): 185–221. doi:10.1145/356842.356846.
↑ Reed, D.P. (1978). Naming and Synchronization in a Decentralized Computer System. MIT dissertation (Thesis). hdl:1721.1/16279 . Retrieved November 12, 2022.
↑ Gallant, John (9 April 1984). "RDB Gets Mixed Greeting". Computerworld. Retrieved 13 September 2021.
↑ "Re: [Firebird-devel] isc_dpb_dbkey_scope | Firebird".
↑ "A not-so-very technical discussion of Multi Version Concurrency Control". firebirdsql.org. Retrieved 2020-11-12.