Distributed lock manager

Last updated

Operating systems use lock managers to organise and serialise the access to resources. A distributed lock manager (DLM) runs in every machine in a cluster, with an identical copy of a cluster-wide lock database. In this way a DLM provides software applications which are distributed across a cluster on multiple machines with a means to synchronize their accesses to shared resources.

Contents

DLMs have been used as the foundation for several successful clustered file systems, in which the machines in a cluster can use each other's storage via a unified file system, with significant advantages for performance and availability. The main performance benefit comes from solving the problem of disk cache coherency between participating computers. The DLM is used not only for file locking but also for coordination of all disk access. VMScluster, the first clustering system to come into widespread use, relied on the OpenVMS DLM in just this way.

Resources

The DLM uses a generalized concept of a resource, which is some entity to which shared access must be controlled. This can relate to a file, a record, an area of shared memory, or anything else that the application designer chooses. A hierarchy of resources may be defined, so that a number of levels of locking can be implemented. For instance, a hypothetical database might define a resource hierarchy as follows:

A process can then acquire locks on the database as a whole, and then on particular parts of the database. A lock must be obtained on a parent resource before a subordinate resource can be locked.

Lock modes

A process running within a VMSCluster may obtain a lock on a resource. There are six lock modes that can be granted, and these determine the level of exclusivity being granted, it is possible to convert the lock to a higher or lower level of lock mode. When all processes have unlocked a resource, the system's information about the resource is destroyed.

The following truth table shows the compatibility of each lock mode with the others:

ModeNLCRCWPRPWEX
NLYesYesYesYesYesYes
CRYesYesYesYesYesNo
CWYesYesYesNoNoNo
PRYesYesNoYesNoNo
PWYesYesNoNoNoNo
EXYesNoNoNoNoNo

Obtaining a lock

A process can obtain a lock on a resource by enqueueing a lock request. This is similar to the QIO technique that is used to perform I/O. The enqueue lock request can either complete synchronously, in which case the process waits until the lock is granted, or asynchronously, in which case an AST occurs when the lock has been obtained.

It is also possible to establish a blocking AST, which is triggered when a process has obtained a lock that is preventing access to the resource by another process. The original process can then optionally take action to allow the other access (e.g. by demoting or releasing the lock).

Lock value block

A lock value block is associated with each resource. This can be read by any process that has obtained a lock on the resource (other than a null lock) and can be updated by a process that has obtained a protected update or exclusive lock on it.

It can be used to hold any information about the resource that the application designer chooses. A typical use is to hold a version number of the resource. Each time the associated entity (e.g. a database record) is updated, the holder of the lock increments the lock value block. When another process wishes to read the resource, it obtains the appropriate lock and compares the current lock value with the value it had last time the process locked the resource. If the value is the same, the process knows that the associated entity has not been updated since last time it read it, and therefore it is unnecessary to read it again. Hence, this technique can be used to implement various types of cache in a database or similar application.

Deadlock detection

When one or more processes have obtained locks on resources, it is possible to produce a situation where each is preventing another from obtaining a lock, and none of them can proceed. This is known as a deadlock (E. W. Dijkstra originally called it a deadly embrace). [1]

A simple example is when Process 1 has obtained an exclusive lock on Resource A, and Process 2 has obtained an exclusive lock on Resource B. If Process 1 then tries to lock Resource B, it will have to wait for Process 2 to release it. But if Process 2 then tries to lock Resource A, both processes will wait forever for each other.

The OpenVMS DLM periodically checks for deadlock situations. In the example above, the second lock enqueue request of one of the processes would return with a deadlock status. It would then be up to this process to take action to resolve the deadlock—in this case by releasing the first lock it obtained.

Linux clustering

Both Red Hat and Oracle have developed clustering software for Linux.

OCFS2, the Oracle Cluster File System was added [2] to the official Linux kernel with version 2.6.16, in January 2006. The alpha-quality code warning on OCFS2 was removed in 2.6.19.

Red Hat's cluster software, including their DLM and GFS2 was officially added to the Linux kernel [3] with version 2.6.19, in November 2006.

Both systems use a DLM modeled on the venerable VMS DLM. [4] Oracle's DLM has a simpler API. (the core function, dlmlock(), has eight parameters, whereas the VMS SYS$ENQ service and Red Hat's dlm_lock both have 11.)

Other implementations

Other DLM implementations include the following:

Related Research Articles

<span class="mw-page-title-main">Mutual exclusion</span> In computing, restricting data to be accessible by one thread at a time

In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurrent thread of execution is already accessing said critical section, which refers to an interval of time during which a thread of execution accesses a shared resource or shared memory.

DECnet is a suite of network protocols created by Digital Equipment Corporation. Originally released in 1975 in order to connect two PDP-11 minicomputers, it evolved into one of the first peer-to-peer network architectures, thus transforming DEC into a networking powerhouse in the 1980s. Initially built with three layers, it later (1982) evolved into a seven-layer OSI-compliant networking protocol.

<span class="mw-page-title-main">Deadlock</span> State in which members are blocking each other

In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a lock. Deadlocks are a common problem in multiprocessing systems, parallel computing, and distributed systems, because in these contexts systems often use software or hardware locks to arbitrate shared resources and implement process synchronization.

<span class="mw-page-title-main">Semaphore (programming)</span> Variable used in a concurrent system

In computer science, a semaphore is a variable or abstract data type used to control access to a common resource by multiple threads and avoid critical section problems in a concurrent system such as a multitasking operating system. Semaphores are a type of synchronization primitive. A trivial semaphore is a plain variable that is changed depending on programmer-defined conditions.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

In databases and transaction processing, two-phase locking (2PL) is a pessimistic concurrency control method that guarantees serializability. It is also the name of the resulting set of database transaction schedules (histories). The protocol uses locks, applied by a transaction to data, which may block other transactions from accessing the same data during the transaction's life.

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior, so parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. One way to do so is known as a critical section or critical region. This protected section cannot be entered by more than one process or thread at a time; others are suspended until the first leaves the critical section. Typically, the critical section accesses a shared resource, such as a data structure, a peripheral device, or a network connection, that would not operate correctly in the context of multiple concurrent accesses.

Record locking is the technique of preventing simultaneous access to data in a database, to prevent inconsistent results.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

Filesystem in Userspace (FUSE) is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

File locking is a mechanism that restricts access to a computer file, or to a region of a file, by allowing only one user or process to modify or delete it at a specific time and to prevent reading of the file while it's being modified or deleted.

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in November 2022, Frontier, as well as previous top supercomputers such as Fugaku, Titan and Sequoia.

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

The Oracle Cluster File System is a shared disk file system developed by Oracle Corporation and released under the GNU General Public License. The first version of OCFS was developed with the main focus to accommodate Oracle's database management system that used cluster computing. Because of that it was not a POSIX-compliant file system. With version 2 the POSIX features were included.

<span class="mw-page-title-main">Distributed Replicated Block Device</span> Distributed replicated storage system for Linux

DRBD is a distributed replicated storage system for the Linux platform. It is implemented as a kernel driver, several userspace management applications, and some shell scripts. DRBD is traditionally used in high availability (HA) computer clusters, but beginning with DRBD version 9, it can also be used to create larger software defined storage pools with a focus on cloud integration.

In computer science, a readers–writer is a synchronization primitive that solves one of the readers–writers problems. An RW lock allows concurrent access for read-only operations, whereas write operations require exclusive access. This means that multiple threads can read the data in parallel but an exclusive lock is needed for writing or modifying data. When a writer is writing the data, all other writers and readers will be blocked until the writer is finished writing. A common use might be to control access to a data structure in memory that cannot be updated atomically and is invalid until the update is complete.

In computer science, synchronization is the task of coordinating multiple of processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.

A VMScluster, originally known as a VAXcluster, is a computer cluster involving a group of computers running the OpenVMS operating system. Whereas tightly coupled multiprocessor systems run a single copy of the operating system, a VMScluster is loosely coupled: each machine runs its own copy of OpenVMS, but the disk storage, lock manager, and security domain are all cluster-wide, providing a single system image abstraction. Machines can join or leave a VMScluster without affecting the rest of the cluster. For enhanced availability, VMSclusters support the use of dual-ported disks connected to two machines or storage controllers simultaneously.

References

  1. Gehani, Narain (1991). Ada: Concurrent Programming. Silicon Press. p. 105. ISBN   9780929306087.
  2. kernel/git/torvalds/linux.git - Linux kernel source tree [ permanent dead link ]. Kernel.org. Retrieved on 2013-09-18.
  3. kernel/git/torvalds/linux.git - Linux kernel source tree Archived 2012-07-18 at archive.today . Git.kernel.org (2006-12-07). Retrieved on 2013-09-18.
  4. The OCFS2 filesystem. Lwn.net (2005-05-24). Retrieved on 2013-09-18.
  5. 1 2 Google Research Publication: Chubby Distributed Lock Service. Research.google.com. Retrieved on 2013-09-18.
  6. . Zookeeper.apache.org. Retrieved on 2013-09-18.
  7. "CoreOS". coreos.com.
  8. etcd: Distributed reliable key-value store for the most critical data of a distributed system, CoreOS, 2018-01-16, retrieved 2016-09-20
  9. redis.io http://redis.io/ . Retrieved 2015-04-14.{{cite web}}: Missing or empty |title= (help)[ title missing ]
  10. "Distributed locks with Redis – Redis". redis.io. Retrieved 2015-04-14.
  11. Consul Overview. Retrieved on 2015-02-19.
  12. Taooka Description Archived 2017-05-03 at the Wayback Machine Retrieved on 2017-05-04.