Software lockout

Last updated November 25, 2024

In multiprocessor computer systems, software lockout is the issue of performance degradation due to the idle wait times spent by the CPUs in kernel-level critical sections. Software lockout is the major cause of scalability degradation in a multiprocessor system, posing a limit on the maximum useful number of processors. To mitigate the phenomenon, the kernel must be designed to have its critical sections as short as possible, therefore decomposing each data structure in smaller substructures.

Kernel-level critical sections

In most multiprocessor systems, each processor schedules and controls itself, therefore there's no "supervisor" processor,^[1] and kernel data structures are globally shared; sections of code that access those shared data structures are critical sections. This design choice is made to improve scaling, reliability and modularity.^[1] Examples of such kernel data structure are ready list and communication channels.

A "conflict" happens when more than one processor is trying to access the same resource (a memory portion) at the same time. To prevent critical races and inconsistency, only one processor (CPU) at a given time is allowed to access a particular data structure (a memory portion), while other CPUs trying to access at the same time are locked-out, waiting in idle status.^[1]^[2]

Three cases can be distinguished when this idle wait is either necessary, convenient, or not convenient. The idle wait is necessary when the access is to a ready list for a low level scheduling operation. The idle wait is not necessary but convenient in the case of a critical section for synchronization/IPC operations, which require less time than a context switch (executing another process to avoid idle wait). Idle wait is instead not convenient in case of a kernel critical section for device management, present in monolithic kernels only. A microkernel instead falls on just the first two of the above cases.

In a multiprocessor system, most of the conflicts are kernel-level conflicts, due to the access to the kernel level critical sections, and thus the idle wait periods generated by them have a major impact in performance degradation. This idle wait time increases the average number of idle processors and thus decreases scalability and relative efficiency.

Analytical studies

Taking as parameters the average time interval spent by a processor in kernel level critical sections (L, for time in locked state), and the average time interval spent by a processor in tasks outside critical sections (E),^[1] the ratio L/E is crucial in evaluating software lockout.

Typical values for L/E range from 0.01 to 0.1.^[3] In a system with a L/E ratio of 0.05, for instance, if there are 15 CPUs, it is expected that on average 1 CPU will always be idle;^[3] with 21 CPUs, 2.8 will be idle;^[4] with 40 CPUs, 19 will be idle; with 41 CPUs, 20 will be idle.^[3] Therefore, adding more than 40 CPUs to that system would be useless. In general, for each L/E value, there's a threshold for the maximum number of useful CPUs.

Software lockout mitigation

To reduce the performance degradation of software lockout to reasonable levels (L/E between 0.05 and 0.1), the kernel and/or the operating system must be designed accordingly. Conceptually, the most valid solution is to decompose each kernel data structure in smaller independent substructures, having each a shorter elaboration time. This allows more than one CPU to access the original data structure.

Many uniprocessor systems with hierarchical protection domains have been estimated to spend up to 50% of the time performing "supervisor mode" operations. If such systems were adapted for multiprocessing by setting a lock at any access to "supervisor state", L/E would easily be greater than 1,^[3] resulting in a system with the same throughput as the uniprocessor despite the number of CPUs.

Notes

1 2 3 4 Madnick 1968, p.19
↑ Saltzer, Jerome (June 1966). Traffic control in a multiplexed computer system (PDF) (PhD). MIT Project MAC MAC-TR-30.
1 2 3 4 Madnick 1968, p.20
↑ Raynor 76, p.62

Related Research Articles

<span class="mw-page-title-main">Computer multitasking</span> Concurrent execution of multiple processes

In computing, multitasking is the concurrent execution of multiple tasks over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result, a computer executes segments of multiple tasks in an interleaved manner, while the tasks share common processing resources such as central processing units (CPUs) and main memory. Multitasking automatically interrupts the running program, saving its state and loading the saved state of another program and transferring control to it. This "context switch" may be initiated at fixed time intervals, or the running program may be coded to signal to the supervisory software when it can be interrupted.

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurrent thread of execution is already accessing said critical section, which refers to an interval of time during which a thread of execution accesses a shared resource or shared memory.

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location, thus "swapping" the read and written values.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior. Thus, the parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. One way to do so is known as a critical section or critical region. This protected section cannot be entered by more than one process or thread at a time; others are suspended until the first leaves the critical section. Typically, the critical section accesses a shared resource, such as a data structure, peripheral device, or network connection, that would not operate correctly in the context of multiple concurrent accesses.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

In computer science, hierarchical protection domains, often called protection rings, are mechanisms to protect data and functionality from faults and malicious behavior.

ntoskrnl.exe, also known as the kernel image, contains the kernel and executive layers of the Microsoft Windows NT kernel, and is responsible for hardware abstraction, process handling, and memory management. In addition to the kernel and executive layers, it contains the cache manager, security reference monitor, memory manager, scheduler (Dispatcher), and blue screen of death.

In computer science, synchronization is the task of coordinating multiple processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

The kernel is a computer program at the core of a computer's operating system and generally has complete control over everything in the system. The kernel is also responsible for preventing and mitigating conflicts between different processes. It is the portion of the operating system code that is always resident in memory and facilitates interactions between hardware and software components. A full kernel controls all hardware resources via device drivers, arbitrates conflicts between processes concerning such resources, and optimizes the utilization of common resources e.g. CPU & cache usage, file systems, and network sockets. On most systems, the kernel is one of the first programs loaded on startup. It handles the rest of startup as well as memory, peripherals, and input/output (I/O) requests from software, translating them into data-processing instructions for the central processing unit.

A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

References

Madnick, Stuart Elliot (1968) Multi-processor software lockout Proceedings of the 1968 23rd ACM national conference, pp. 19 – 24
M. Dubois, F. Briggs The run-time efficiency of parallel asynchronous algorithms IEEE Transactions on Computers, November 1991 (Vol. 40, No. 11) pp. 1260–1266
Randy J. Raynor, John M. Gwynn, Jr. Minimization of supervisor conflict for multiprocessor computer systems ACM SIGSIM Simulation Digest. Volume 7, Issue 4 (July 1976). pp. 61 – 69

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing