Transactional memory

Last updated

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems. [1]

Contents

Motivation

Atomicity between two parallel transactions with a conflict Transactional Memory.png
Atomicity between two parallel transactions with a conflict

In concurrent programming, synchronization is required when parallel threads attempt to access a shared resource. Low-level thread synchronization constructs such as locks are pessimistic and prohibit threads that are outside a critical section from making any changes. The process of applying and releasing locks often functions as additional overhead in workloads with little conflict among threads. Transactional memory provides optimistic concurrency control by allowing threads to run in parallel with minimal interference. [2] The goal of transactional memory systems is to transparently support regions of code marked as transactions by enforcing atomicity, consistency and isolation.

A transaction is a collection of operations that can execute and commit changes as long as a conflict is not present. When a conflict is detected, a transaction will revert to its initial state (prior to any changes) and will rerun until all conflicts are removed. Before a successful commit, the outcome of any operation is purely speculative inside a transaction. In contrast to lock-based synchronization where operations are serialized to prevent data corruption, transactions allow for additional parallelism as long as few operations attempt to modify a shared resource. Since the programmer is not responsible for explicitly identifying locks or the order in which they are acquired, programs that utilize transactional memory cannot produce a deadlock. [2]

With these constructs in place, transactional memory provides a high-level programming abstraction by allowing programmers to enclose their methods within transactional blocks. Correct implementations ensure that data cannot be shared between threads without going through a transaction and produce a serializable outcome. For example, code can be written as:

deftransfer_money(from_account,to_account,amount):"""Transfer money from one account to another."""withtransaction():from_account.balance-=amountto_account.balance+=amount

In the code, the block defined by "transaction" is guaranteed atomicity, consistency and isolation by the underlying transactional memory implementation and is transparent to the programmer. The variables within the transaction are protected from external conflicts, ensuring that either the correct amount is transferred or no action is taken at all. Note that concurrency related bugs are still possible in programs that use a large number of transactions, especially in software implementations where the library provided by the language is unable to enforce correct use. Bugs introduced through transactions can often be difficult to debug since breakpoints cannot be placed within a transaction. [2]

Transactional memory is limited in that it requires a shared-memory abstraction. Although transactional memory programs cannot produce a deadlock, programs may still suffer from a livelock or resource starvation. For example, longer transactions may repeatedly revert in response to multiple smaller transactions, wasting both time and energy. [2]

Hardware vs. software

Hardware transactional memory using read and write bits Hardware Transaction.png
Hardware transactional memory using read and write bits

The abstraction of atomicity in transactional memory requires a hardware mechanism to detect conflicts and undo any changes made to shared data. [3] Hardware transactional memory systems may comprise modifications in processors, cache and bus protocol to support transactions. [4] [5] [6] [7] [8] Speculative values in a transaction must be buffered and remain unseen by other threads until commit time. Large buffers are used to store speculative values while avoiding write propagation through the underlying cache coherence protocol. Traditionally, buffers have been implemented using different structures within the memory hierarchy such as store queues or caches. Buffers further away from the processor, such as the L2 cache, can hold more speculative values (up to a few megabytes). The optimal size of a buffer is still under debate due to the limited use of transactions in commercial programs. [3] In a cache implementation, the cache lines are generally augmented with read and write bits. When the hardware controller receives a request, the controller uses these bits to detect a conflict. If a serializability conflict is detected from a parallel transaction, then the speculative values are discarded. When caches are used, the system may introduce the risk of false conflicts due to the use of cache line granularity. [3] Load-link/store-conditional (LL/SC) offered by many RISC processors can be viewed as the most basic transactional memory support; however, LL/SC usually operates on data that is the size of a native machine word, so only single-word transactions are supported. [4] Although hardware transactional memory provides maximal performance compared to software alternatives, limited use has been seen at this time.

Software transactional memory provides transactional memory semantics in a software runtime library or the programming language, [9] and requires minimal hardware support (typically an atomic compare and swap operation, or equivalent). As the downside, software implementations usually come with a performance penalty, when compared to hardware solutions. Hardware acceleration can reduce some of the overheads associated with software transactional memory.

Owing to the more limited nature of hardware transactional memory (in current implementations), software using it may require fairly extensive tuning to fully benefit from it. For example, the dynamic memory allocator may have a significant influence on performance and likewise structure padding may affect performance (owing to cache alignment and false sharing issues); in the context of a virtual machine, various background threads may cause unexpected transaction aborts. [10]

History

One of the earliest implementations of transactional memory was the gated store buffer used in Transmeta's Crusoe and Efficeon processors. However, this was only used to facilitate speculative optimizations for binary translation, rather than any form of speculative multithreading, or exposing it directly to programmers. Azul Systems also implemented hardware transactional memory to accelerate their Java appliances, but this was similarly hidden from outsiders. [11]

Sun Microsystems implemented hardware transactional memory and a limited form of speculative multithreading in its high-end Rock processor. This implementation proved that it could be used for lock elision and more complex hybrid transactional memory systems, where transactions are handled with a combination of hardware and software. The Rock processor was canceled in 2009, just before the acquisition by Oracle; while the actual products were never released, a number of prototype systems were available to researchers. [11]

In 2009, AMD proposed the Advanced Synchronization Facility (ASF), a set of x86 extensions that provide a very limited form of hardware transactional memory support. The goal was to provide hardware primitives that could be used for higher-level synchronization, such as software transactional memory or lock-free algorithms. However, AMD has not announced whether ASF will be used in products, and if so, in what timeframe. [11]

More recently, IBM announced in 2011 that Blue Gene/Q had hardware support for both transactional memory and speculative multithreading. The transactional memory could be configured in two modes; the first is an unordered and single-version mode, where a write from one transaction causes a conflict with any transactions reading the same memory address. The second mode is for speculative multithreading, providing an ordered, multi-versioned transactional memory. Speculative threads can have different versions of the same memory address, and hardware implementation keeps track of the age for each thread. The younger threads can access data from older threads (but not the other way around), and writes to the same address are based on the thread order. In some cases, dependencies between threads can cause the younger versions to abort. [11]

Intel's Transactional Synchronization Extensions (TSX) is available in some of the Skylake processors. It was earlier implemented in Haswell and Broadwell processors as well, but the implementations turned out both times to be defective and support for TSX was disabled. The TSX specification describes the transactional memory API for use by software developers, but withholds details on technical implementation. [11] ARM architecture has a similar extension. [12]

As of GCC 4.7, an experimental library for transactional memory is available which utilizes a hybrid implementation. The PyPy variant of Python also introduces transactional memory to the language.

Available implementations

See also

Related Research Articles

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems. In Modern Operating Systems, Tanenbaum shows that many distinct models of process organization are possible. In many cases, a thread is a component of a process. The multiple threads of a given process may be executed concurrently, sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

<span class="mw-page-title-main">Symmetric multiprocessing</span> The equal sharing of all resources by multiple identical processors

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

<span class="mw-page-title-main">Race condition</span> When a systems behavior depends on timing of uncontrollable events

A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

Release consistency is one of the synchronization-based consistency models used in concurrent programming.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In computer science, a parallel random-access machine is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM). In the same way that the RAM is used by sequential-algorithm designers to model algorithmic performance, the PRAM is used by parallel-algorithm designers to model parallel algorithmic performance. Similar to the way in which the RAM model neglects practical issues, such as access time to cache memory versus main memory, the PRAM model neglects such issues as synchronization and communication, but provides any (problem-size-dependent) number of processors. Algorithm cost, for instance, is estimated using two parameters O(time) and O(time × processor_number).

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer science, software transactional memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It is an alternative to lock-based synchronization. STM is a strategy implemented in software, rather than as a hardware component. A transaction in this context occurs when a piece of code executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions. The idea of providing hardware support for transactions originated in a 1986 paper by Tom Knight. The idea was popularized by Maurice Herlihy and J. Eliot B. Moss. In 1995 Nir Shavit and Dan Touitou extended this idea to software-only transactional memory (STM). Since 2005, STM has been the focus of intense research and support for practical implementations is growing.

Double compare-and-swap is an atomic primitive proposed to support certain concurrent programming techniques. DCAS takes two not necessarily contiguous memory locations and writes new values into them only if they match pre-supplied "expected" values; as such, it is an extension of the much more popular compare-and-swap (CAS) operation.

<span class="mw-page-title-main">Rock (processor)</span>

Rock was a multithreading, multicore, SPARC microprocessor under development at Sun Microsystems. Canceled in 2010, it was a separate project from the SPARC T-Series (CoolThreads/Niagara) family of processors.

Thread Level Speculation (TLS), also known as Speculative Multithreading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Memory-level parallelism (MLP) is a term in computer architecture referring to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads on a computer.

Transactional Synchronization Extensions (TSX), also called Transactional Synchronization Extensions New Instructions (TSX-NI), is an extension to the x86 instruction set architecture (ISA) that adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. According to different benchmarks, TSX/TSX-NI can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

References

  1. Harris, Tim; Larus, James; Rajwar, Ravi (2010-06-02). "Transactional Memory, 2nd edition". Synthesis Lectures on Computer Architecture. 5 (1): 1–263. doi:10.2200/S00272ED1V01Y201006CAC011. ISSN   1935-3235.
  2. 1 2 3 4 "Transactional Memory: History and Development". Kukuruku Hub. Retrieved 2016-11-16.
  3. 1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Berkeley, California: Chapman & Hall. pp. 287–292. ISBN   978-1-4822-1118-4.
  4. 1 2 Herlihy, Maurice; Moss, J. Eliot B. (1993). "Transactional memory: Architectural support for lock-free data structures" (PDF). Proceedings of the 20th International Symposium on Computer Architecture (ISCA). pp. 289–300.
  5. Stone, J.M.; Stone, H.S.; Heidelberger, P.; Turek, J. (1993). "Multiple Reservations and the Oklahoma Update". IEEE Parallel & Distributed Technology: Systems & Applications. 1 (4): 58–71. doi:10.1109/88.260295. S2CID   11017196.
  6. Hammond, L; Wong, V.; Chen, M.; Carlstrom, B.D.; Davis, J.D.; Hertzberg, B.; Prabhu, M.K.; Honggo Wijaya; Kozyrakis, C.; Olukotun, K. (2004). "Transactional memory coherence and consistency". Proceedings of the 31st annual International Symposium on Computer Architecture (ISCA). pp. 102–13. doi:10.1109/ISCA.2004.1310767.
  7. Ananian, C.S.; Asanovic, K.; Kuszmaul, B.C.; Leiserson, C.E.; Lie, S. (2005). "Unbounded transactional memory". 11th International Symposium on High-Performance Computer Architecture. pp. 316–327. doi:10.1109/HPCA.2005.41. ISBN   0-7695-2275-0.
  8. "LogTM: Log-based transactional memory" (PDF). WISC.
  9. "The ATOMOΣ Transactional Programming Language" (PDF). Stanford. Archived from the original (PDF) on 2008-05-21. Retrieved 2009-06-15.
  10. Odaira, R.; Castanos, J. G.; Nakaike, T. (2013). "Do C and Java programs scale differently on Hardware Transactional Memory?". 2013 IEEE International Symposium on Workload Characterization (IISWC). p. 34. doi:10.1109/IISWC.2013.6704668. ISBN   978-1-4799-0555-3.
  11. 1 2 3 4 5 David Kanter (2012-08-21). "Analysis of Haswell's Transactional Memory". Real World Technologies. Retrieved 2013-11-19.
  12. "Arm releases SVE2 and TME for A-profile architecture - Processors blog - Processors - Arm Community". community.arm.com. Retrieved 2019-05-25.
  13. "Transactional Memory Extension (TME) intrinsics" . Retrieved 2020-05-05.
  14. "IBM plants transactional memory in CPU". EE Times.
  15. Brian Hall; Ryan Arnold; Peter Bergner; Wainer dos Santos Moschetta; Robert Enenkel; Pat Haugen; Michael R. Meissner; Alex Mericas; Philipp Oehler; Berni Schiefer; Brian F. Veale; Suresh Warrier; Daniel Zabawa; Adhemerval Zanella (2014). Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8 (PDF). IBM Redbooks. pp. 37–40. ISBN   978-0-7384-3972-3.
  16. Wei Li, IBM XL compiler hardware transactional memory built-in functions for IBM AIX on IBM POWER8 processor-based systems
  17. "Power ISA Version 3.1". openpowerfoundation.org. 2020-05-01. Retrieved 2020-10-10.
  18. Java on a 1000 Cores – Tales of Hardware/Software CoDesign on YouTube
  19. "Control.Monad.STM". hackage.haskell.org. Retrieved 2020-02-06.
  20. "STMX Homepage".
  21. Wong, Michael. "Transactional Language Constructs for C++" (PDF). Retrieved 12 Jan 2011.
  22. "Brief Transactional Memory GCC tutorial".
  23. "C Dialect Options - Using the GNU Compiler Collection (GCC)".
  24. "TransactionalMemory - GCC Wiki".
  25. Rigo, Armin. "Using All These Cores: Transactional Memory in PyPy". europython.eu. Retrieved 7 April 2015.
  26. "picotm - Portable Integrated Customizable and Open Transaction Manager".
  27. "Concurrent::TVar".

Further reading