False sharing

Last updated

In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. [1] The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

Contents

Multiprocessor CPU caches

By far the most common usage of this term is in modern multiprocessor CPU caches, where memory is cached in lines of some small power of two word size (e.g., 64 aligned, contiguous bytes). If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth. In some cases, the elimination of false sharing can result in order-of-magnitude performance improvements. [2] False sharing is an inherent artifact of automatically synchronized cache protocols and can also exist in environments such as distributed file systems or databases, but current prevalence is limited to RAM caches.

Example

#include<iostream>#include<thread>#include<new>#include<atomic>#include<chrono>#include<latch>#include<vector>usingnamespacestd;usingnamespacechrono;#if defined(__cpp_lib_hardware_interference_size)// default cacheline size from runtimeconstexprsize_tCL_SIZE=hardware_constructive_interference_size;#else// most common cacheline size otherwiseconstexprsize_tCL_SIZE=64;#endifintmain(){vector<jthread>threads;inthc=jthread::hardware_concurrency();hc=hc<=CL_SIZE?hc:CL_SIZE;for(intnThreads=1;nThreads<=hc;++nThreads){// synchronize beginning of threads coarse on kernel levellatchcoarseSync(nThreads);// fine synch via atomic in userspaceatomic_uintfineSync(nThreads);// as much chars as would fit into a cachelinestructalignas(CL_SIZE){charshareds[CL_SIZE];}cacheLine;// sum of all threads execution timesatomic_int64_tnsSum(0);for(intt=0;t!=nThreads;++t)threads.emplace_back([&](charvolatile&c){coarseSync.arrive_and_wait();// synch beginning of thread execution on kernel-levelif(fineSync.fetch_sub(1,memory_order::relaxed)!=1)// fine-synch on user-levelwhile(fineSync.load(memory_order::relaxed));autostart=high_resolution_clock::now();for(size_tr=10'000'000;r--;)c=c+1;nsSum+=duration_cast<nanoseconds>(high_resolution_clock::now()-start).count();},ref(cacheLine.shareds[t]));threads.resize(0);// join all threadscout<<nThreads<<": "<<(int)(nsSum/(1.0e7*nThreads)+0.5)<<endl;}}

This code shows the effect of false sharing. It creates an increasing number of threads from one thread to the number of physical threads in the system. Each thread sequentially increments one byte of a cache line, which as a whole is shared among all threads. The higher the level of contention between threads, the longer each increment takes. This are the results on a Zen4 system with 16 cores and 32 threads:

1: 1 2: 4 3: 6 4: 9 5: 11 6: 13 7: 15 8: 17 9: 16 10: 18 11: 21 12: 25 13: 29 14: 35 15: 39 16: 41 17: 43 18: 44 19: 48 20: 49 21: 51 22: 53 23: 58 24: 61 25: 68 26: 75 27: 79 28: 82 29: 85 30: 88 31: 91 32: 94

As you can see, on the system in question it can take up to a 100 nanoseconds to complete an increment operation on the shared cache line, which corresponds to approx. 420 clock cycles on this CPU.

Mitigation

There are ways of mitigating the effects of false sharing. For instance, false sharing in CPU caches can be prevented by reordering variables or adding padding (unused bytes) between variables. However, some of these program changes may increase the size of the objects, leading to higher memory use. [2] Compile-time data transformations can also mitigate false-sharing. [3] However, some of these transformations may not always be allowed. For instance, the C++ programming language standard draft of C++23 mandates that data members must be laid out so that later members have higher addresses. [4]

There are tools for detecting false sharing. [5] [6] There are also systems that both detect and repair false sharing in executing programs. However, these systems incur some execution overhead. [7] [8]

Related Research Articles

<span class="mw-page-title-main">Non-uniform memory access</span> Computer memory design used in multiprocessing

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on blocks or "goes to sleep".

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

Double compare-and-swap is an atomic primitive proposed to support certain concurrent programming techniques. DCAS takes two not necessarily contiguous memory locations and writes new values into them only if they match pre-supplied "expected" values; as such, it is an extension of the much more popular compare-and-swap (CAS) operation.

In computer science, load-linked/store-conditional (LL/SC), sometimes known as load-reserved/store-conditional (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

<span class="mw-page-title-main">Cray XMT</span>

Cray XMT is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture, targeted at large graph problems. Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support. It brought various substantial improvements over Cray MTA-2, most notably nearly tripling the peak performance, and vastly increased maximum CPU count to 8,192 and maximum memory to 128 TB, with a data TLB of maximal 512 TB.

Thread Level Speculation (TLS), also known as Speculative Multithreading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

In computer science, the five-minute rule is a rule of thumb for deciding whether a data item should be kept in memory, or stored on disk and read back into memory when required. It was first formulated by Jim Gray and Gianfranco Putzolu in 1985, and then subsequently revised in 1997 and 2007 to reflect changes in the relative cost and performance of memory and persistent storage.

<span class="mw-page-title-main">Kathryn S. McKinley</span> American computer scientist

Kathryn S. McKinley is an American computer scientist noted for her research on compilers, runtime systems, and computer architecture. She is also known for her leadership in broadening participation in computing. McKinley was co-chair of CRA-W from 2011 to 2014.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

References

  1. Patterson, David (2012). Computer organization and design: the hardware/software interface. Waltham, MA: Morgan Kaufmann. p. 537. ISBN   978-0-12-374750-1. OCLC   746618653.
  2. 1 2 Bolosky, William J.; Scott, Michael L. (1993-09-22). "False sharing and its effect on shared memory performance". Sedms'93: USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems. 4. Retrieved 11 July 2021.
  3. Jeremiassen, Tor E.; Eggers, Susan J. (1995). "Reducing false sharing on shared memory multiprocessors through compile time data transformations". ACM SIGPLAN Notices. Association for Computing Machinery (ACM). 30 (8): 179–188. doi: 10.1145/209937.209955 . ISSN   0362-1340.
  4. "Working Draft, Standard for Programming Language C++ [class]". eel.is. Retrieved 2021-07-11.
  5. "perf-c2c(1)". Linux manual page. 2016-09-01. Retrieved 2021-08-08.
  6. Chabbi, Milind; Wen, Shasha; Liu, Xu (2018-02-10). "Featherlight on-the-fly false-sharing detection". Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, NY, USA: ACM. pp. 152–167. doi:10.1145/3178487.3178499. ISBN   9781450349826.
  7. Nanavati, Mihir; Spear, Mark; Taylor, Nathan; Rajagopalan, Shriram; Meyer, Dutch T.; Aiello, William; Warfield, Andrew (2013). "Whose cache line is it anyway?". Proceedings of the 8th ACM European Conference on Computer Systems. New York, New York, USA: ACM Press. pp. 141–154. doi:10.1145/2465351.2465366. ISBN   9781450319942.
  8. Liu, Tongping; Berger, Emery D. (2011-10-18). "SHERIFF: precise detection and automatic mitigation of false sharing". ACM SIGPLAN Notices. Association for Computing Machinery (ACM). 46 (10): 3–18. doi:10.1145/2076021.2048070. ISSN   0362-1340.