Release consistency

Last updated July 27, 2023

Release consistency is one of the synchronization-based consistency models used in concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Introduction
Release consistency vs. sequential consistency
Hardware structure and program-level effort
Conditions for sequential consistent result
Implementations
Lock release
Post-wait
Lazy release consistency
Example
Performance improvement over release consistency
Drawback
Release consistency vs. other relaxed consistency models
Weak ordering (Weak consistency)
Processor consistency
See also
References

Introduction

In modern parallel computing systems, memory consistency must be maintained to avoid undesirable outcomes. Strict consistency models like sequential consistency are intuitively composed but can be quite restrictive in terms of performance as they would disable instruction level parallelism which is widely applied in sequential programming. To achieve better performance, some relaxed models are explored and release consistency is an aggressive relaxing attempt.^[1]

Release consistency vs. sequential consistency

Hardware structure and program-level effort

Sequential consistency can be achieved simply by hardware implementation, while release consistency is also based on an observation that most of the parallel programs are properly synchronized. In programming level, synchronization is applied to clearly schedule a certain memory access in one thread to occur after another. When a synchronized variable is accessed, hardware would make sure that all writes local to a processor have been propagated to all other processors and all writes from other processors are seen and gathered. In release consistency model, the action of entering and leaving a critical section are classified as acquire and release and for either case, explicit code should be put in the program showing when to do these operations.

Conditions for sequential consistent result

In general, a distributed shared memory is release consistent if it obeys the following rules:^[2]

1. Before an access to a shared variable is performed, all previous acquires by this processor must have completed.

2. Before a release is performed, all previous reads and writes by this process must have completed.

3. The acquire and release accesses must be processor consistent.

If the conditions above are met and the program is properly synchronized (i.e., processors implement acquire and release properly), the results of any execution will be exactly the same as they would have been executed following sequential consistency. In effect, accesses to shared variables are separated into atomic operation blocks by the acquire and release primitives so that races and interleaving between blocks will be prevented.

Implementations

Lock release

A lock release can be considered as a type of release synchronization. Assume a loop operation is performed using the code shown to the right. Two threads intend to enter a critical section and read the most recent value of a, then exit the critical section. The code shows that thread 0 first acquires the lock and enters the critical section. In order to execute correctly, P1 must read the latest value of a written by P0. In that case, only one thread can be in the critical section at a time. Therefore, the synchronization itself has ensured that the successful lock acquisition at P1 occurs after lock release by P0. Besides, the S2 -> S3 ordering has to be ensured, since the P0 must propagate the new value of a to P1. For the same reason, S5 must occur after S4.^[3]

Correctness is not affected if memory accesses following the unlock issue before the unlock complete or memory accesses prior to a lock issue after the lock acquisition. However, the code in critical section can not be issued prior to the lock acquisition is complete because mutual exclusion may not be guaranteed.

Post-wait

Post-wait synchronization is another implementation form of release consistency. As shown in the code to the right, correctness can be ensured if post operations occur only after all memory access are complete, especially the store to ‘a’. Apart from that, read operation should not be executed until the wait operation has completed. S2 acts as a release synchronization and S3 acts as an acquire synchronization. Therefore, S2 needs to prevent previous execution from occurring after it, and S3 needs to prevent any later execution from occurring before it. S2 does not need to prevent later execution from occurring before it, Likewise, S3 does not need to prevent any previous execution from occurring after it.

Lazy release consistency

Lazy release consistency is a further optimization of release consistency. It assumes that the thread executing an acquire access does not need the values written by other threads until the acquire access has completed. Hence, all behavior of coherence can be delayed and timing for write propagation can be tweaked.^[4]

Example

Consider the scenarios described in the image to the right. This case shows when write propagation is performed on a cache-coherent system based on the release consistency model. The variable datum is completely propagated before the propagation of datumIsReady. But the value of datum is not needed until after the acquire synchronization access in P1 and it can be propagated along with datumIsReady without harming the result of the program.

The second image displays what is the case when lazy release consistency is applied. Considering this scenario, all values written ahead of the release synchronization are delayed and propagated together with the propagation of the release access itself. Hence, datum and datumIsReady are propagated together at the release point.

"TreadMarks"^[5] is an actual application of lazy release consistency.

Performance improvement over release consistency

Lazy release consistency can outperform release consistency in certain cases. If there is a system with little bandwidth between processors or it suffers badly from the higher overheads because of frequent propagation of small block of data versus infrequent large data block propagation, LRC can really help the performance.

Consider a system employs a software level shared memory abstraction rather than an actual hardware implementation. In this system, write propagation is executed at a page granularity, which makes it extremely expensive to propagate a whole page when only one block in this page is modified. Therefore, write propagation is delayed until a release synchronization point is reached and the entire page will be modified at this time and the whole page will be propagated.

Drawback

LRC requires performing write propagation in bulk at the release point of synchronization. Propagating such a large number of writes altogether will slow down the release access and the subsequent acquire access. Hence it can hardly improve the performance of a hardware cache coherence system.

Release consistency vs. other relaxed consistency models

Weak ordering (Weak consistency)

Release consistency requires more from the programmers compared to weak ordering. They must label synchronization accesses as acquires or releases, not just as synchronization accesses. Similar to weak ordering, Release consistency allows the compiler to freely reorder loads and stores except that they cannot migrate upward past an acquire synchronization and cannot migrate downward past a release synchronization. However, the flexibility and performance advantage of Release consistency comes at the expense of requiring synchronization accesses to be properly identified and identified as acquires and releases. Unlike in weak ordering, synchronization accesses cannot be easily identified by instruction opcodes alone. Hence, the burden is on programmers’ shoulders to properly identify acquire and release synchronization accesses.^[3]^[6]

Processor consistency

For processor consistency, all processes see writes from each processor in the order they were initiated. Writes from different processors may not be seen in the same order, except that writes to the same location will be seen in the same order everywhere. Compared to processor consistency, release consistency is more relaxed because it does not enforce the ordering between stores that happens in processor consistency. It doesn't follow programmers' intuition as it is relatively less restrictive to compiler optimizations.

Related Research Articles

In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurrent thread of execution is already accessing said critical section, which refers to an interval of time during which a thread of execution accesses a shared resource or shared memory.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems. In Modern Operating Systems, Tanenbaum shows that many distinct models of process organization are possible. In many cases, a thread is a component of a process. The multiple threads of a given process may be executed concurrently, sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In software engineering, double-checked locking is a software design pattern used to reduce the overhead of acquiring a lock by testing the locking criterion before acquiring the lock. Locking occurs only if the locking criterion check indicates that locking is required.

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.

In computer science, a lock or mutex is a synchronization primitive: a mechanism that enforces limits on access to a resource when there are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency control policy, and with a variety of possible methods there exists multiple unique implementations for different applications.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memory, but that the address space is shared—i.e., the same physical address on two processors refers to the same location in memory. Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's private memory.

A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior, so parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. One way to do so is known as a critical section or critical region. This protected section cannot be entered by more than one process or thread at a time; others are suspended until the first leaves the critical section. Typically, the critical section accesses a shared resource, such as a data structure, a peripheral device, or a network connection, that would not operate correctly in the context of multiple concurrent accesses.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

The extended list can be re-expressed as a sequential history.
That sequential history is a subset of the original unextended list.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. Data synchronization refers to the idea of keeping multiple copies of a dataset in coherence with one another, or to maintain data integrity. Process synchronization primitives are commonly used to implement data synchronization.

The Java programming language and the Java virtual machine (JVM) is designed to support concurrent programming. All execution takes place in the context of threads. Objects and resources can be accessed by many separate threads. Each thread has its own path of execution, but can potentially access any object in the program. The programmer must ensure read and write access to objects is properly coordinated between threads. Thread synchronization ensures that objects are modified by only one thread at a time and prevents threads from accessing partially updated objects during modification by another thread. The Java language has built-in constructs to support this coordination.

In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads on a computer.

Processor Consistency is one of the consistency models used in the domain of concurrent computing.

References

↑ Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors by Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy published in ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
↑ Tanenbaum, Andrew (1995). Distributed Operating Systems. Pearson Education. pp. 327–330. ISBN 9788177581799.
1 2 Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman & Hall/CRC Computational Science. pp. 315–320. ISBN 9781482211184.
↑ Lazy release consistency for software distributed shared memory by Pete Keleher, Alan L. Cox, and Willy Zwaenepoel published in Proceeding ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
↑ TreadMarks: distributed shared memory on standard workstations and operating systems by Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel published in WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
↑ Culler, David; Gupta, Anoop; Singh, Jaswinder (1997). Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc. pp. 620–626. ISBN 1558603433.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors by Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy published in ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture

[2] Tanenbaum, Andrew (1995). Distributed Operating Systems. Pearson Education. pp. 327–330. ISBN 9788177581799.

[:0-3] 1 2 Solihin, Yan (2015). Fundamentals of Parallel Multicore Architecture. Chapman & Hall/CRC Computational Science. pp. 315–320. ISBN 9781482211184.

[4] Lazy release consistency for software distributed shared memory by Pete Keleher, Alan L. Cox, and Willy Zwaenepoel published in Proceeding ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture

[5] TreadMarks: distributed shared memory on standard workstations and operating systems by Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel published in WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

[6] Culler, David; Gupta, Anoop; Singh, Jaswinder (1997). Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc. pp. 620–626. ISBN 1558603433.

[1]

[2]

[3]

[4]

[5]

[6]