Lockstep (computing)

Last updated June 27, 2024

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel.^[1] The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems (dual modular redundancy), and the error can be automatically corrected if there are at least three systems (triple modular redundancy), via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

To run in lockstep, each system is set up to progress from one well-defined state to the next well-defined state. When a new set of inputs reaches the system, it processes them, generates new outputs and updates its state. This set of changes (new inputs, new outputs, new state) is considered to define that step, and must be treated as an atomic transaction; in other words, either all of it happens, or none of it happens, but not something in between. Sometimes a timeshift (delay) is set between systems, which increases the detection probability of errors induced by external influences (e.g. voltage spikes, ionizing radiation, or in situ reverse engineering).

Lockstep memory

Some vendors, including Intel, use the term lockstep memory to describe a multi-channel memory layout in which cache lines are distributed between two memory channels, so one half of the cache line is stored in a DIMM on the first channel, while the second half goes to a DIMM on the second channel. By combining the single error correction and double error detection (SECDED) capabilities of two ECC-enabled DIMMs in a lockstep layout, their single-device data correction (SDDC) nature can be extended into double-device data correction (DDDC), providing protection against the failure of any single memory chip.^[2]^[3]^[4]^[5]

Downsides of the Intel's lockstep memory layout are the reduction of effectively usable amount of RAM (in case of a triple-channel memory layout, maximum amount of memory reduces to one third of the physically available maximum), and reduced performance of the memory subsystem.^[2]^[4]

Dual modular redundancy

Where the computing systems are duplicated, but both actively process each step, it is difficult to arbitrate between them if their outputs differ at the end of a step. For this reason, it is common practice to run DMR systems as "master/slave" configurations with the slave as a "hot-standby" to the master, rather than in lockstep. Since there is no advantage in having the slave unit actively process each step, a common method of working is for the master to copy its state at the end of each step's processing to the slave. Should the master fail at some point, the slave is ready to continue from the previous known good step.

While either the lockstep or the DMR approach (when combined with some means of detecting errors in the master) can provide redundancy against hardware failure in the master, they do not protect against software error. If the master fails because of a software error, it is highly likely that the slave - in attempting to repeat the execution of the step which failed - will simply repeat the same error and fail in the same way, an example of a common mode failure.

Triple modular redundancy

Where the computing systems are triplicated, it becomes possible to treat them as "voting" systems. If one unit's output disagrees with the other two, it is detected as having failed. The matched output from the other two is treated as correct.

Related Research Articles

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

<span class="mw-page-title-main">Xeon</span> Line of Intel server and workstation processors

Xeon is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same architecture as regular desktop-grade CPUs, but have advanced features such as support for error correction code (ECC) memory, higher core counts, more PCI Express lanes, support for larger amounts of RAM, larger cache memory and extra provision for enterprise-grade reliability, availability and serviceability (RAS) features responsible for handling hardware exceptions through the Machine Check Architecture (MCA). They are often capable of safely continuing execution where a normal processor cannot due to these extra RAS features, depending on the type and severity of the machine-check exception (MCE). Some also support multi-socket systems with two, four, or eight sockets through use of the Ultra Path Interconnect (UPI) bus, which replaced the older QuickPath Interconnect (QPI) bus.

<span class="mw-page-title-main">Altix</span> Supercomputer family

Altix is a line of server computers and supercomputers produced by Silicon Graphics, based on Intel processors. It succeeded the MIPS/IRIX-based Origin 3000 servers.

In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically, this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. Any decrease in operating quality is proportional to the severity of the failure, unlike a naively designed system in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability to maintain functionality when portions of a system break down is referred to as graceful degradation.

Registered memory is computer memory that has a register between the DRAM modules and the system's memory controller. A registered memory module places less electrical load on a memory controller than an unregistered one. Registered memory allows a computer system to remain stable with more memory modules than it would have otherwise.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

<span class="mw-page-title-main">Fully Buffered DIMM</span>

A Fully Buffered DIMM (FB-DIMM) is a type of memory module used in computer systems. It is designed to improve memory performance and capacity by allowing multiple memory modules to be each connected to the memory controller using a serial interface, rather than a parallel one. Unlike the parallel bus architecture of traditional DRAMs, an FB-DIMM has a serial interface between the memory controller and the advanced memory buffer (AMB). Conventionally, data lines from the memory controller have to be connected to data lines in every DRAM module, i.e. via multidrop buses. As the memory width increases together with the access speed, the signal degrades at the interface between the bus and the device. This limits the speed and memory density, so FB-DIMMs take a different approach to solve the problem.

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory.

The IBM BladeCenter was IBM's blade server architecture, until it was replaced by Flex System in 2012. The x86 division was later sold to Lenovo in 2014.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

A memory controller, also known as memory chip controller (MCC) or a memory controller unit (MCU), is a digital circuit that manages the flow of data going to and from a computer's main memory. When a memory controller is integrated into another chip, such as an integral part of a microprocessor, it is usually called an integrated memory controller (IMC).

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components, and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

PureSystems is an IBM product line of factory pre-configured components and servers also being referred to as an "Expert Integrated System". The centrepiece of PureSystems is the IBM Flex System Manager in tandem with the so-called "Patterns of Expertise" for the automated configuration and management of PureSystems.

Intel Ivy Bridge–based Xeon microprocessors is the follow-up to Sandy Bridge-E, using the same CPU core as the Ivy Bridge processor, but in LGA 2011, LGA 1356 and LGA 2011-1 packages for workstations and servers.

Qorivva is a line of Power ISA 2.03-based microcontrollers from Freescale built around one or more PowerPC e200 cores. Within this line are a number of products specifically targeted for functional safety applications. The hardware-based fault detection and correction features found within this line include dual cores that may run in lock-step, full-path ECC, automated self-testing of memory and logic, peripheral redundancy, and monitor/checker cores.

Sapphire Rapids is a codename for Intel's server and workstation processors based on the Golden Cove microarchitecture and produced using Intel 7. It features up to 60 cores and an array of accelerators, and it is the first generation of Intel server and workstation processors to use a chiplet design.

LGA 4677 is a zero insertion force flip-chip land grid array (LGA) CPU socket designed by Intel, compatible with Sapphire Rapids server and workstation processors, which was released in January 2023.

References

↑ Stefan Poledna (1996). Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. p. 80. ISBN 9780585295800 . Retrieved 2014-09-08.
1 2 Sree Syamalakumari (2014-02-18). "Intel Xeon Processor E7 V2 Family Technical Overview, Section 3.1: Intel C104/102 Scalable Memory Buffer". Intel . Retrieved 2014-09-09.
↑ Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel . Retrieved 2014-09-09.
1 2 "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. pp. 8–9. Retrieved 2014-09-09.
↑ "Intel C102/C104 Scalable Memory Buffer Datasheet, Section 1.3.1.2.2: 1:1 Sub-channel Lockstep Mode" (PDF). Intel. February 2014. p. 9. Retrieved 2015-01-25.

External links

Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers, 2005
Chipkill correct memory architecture, August 2000, by David Locklear

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Stefan Poledna (1996). Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. p. 80. ISBN 9780585295800 . Retrieved 2014-09-08.

[intel-xeon-e7-v2-2] 1 2 Sree Syamalakumari (2014-02-18). "Intel Xeon Processor E7 V2 Family Technical Overview, Section 3.1: Intel C104/102 Scalable Memory Buffer". Intel . Retrieved 2014-09-09.

[intel-lockstep-mode-3] Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel . Retrieved 2014-09-09.

[hp-proliant-guidelines-4] 1 2 "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. pp. 8–9. Retrieved 2014-09-09.

[5] "Intel C102/C104 Scalable Memory Buffer Datasheet, Section 1.3.1.2.2: 1:1 Sub-channel Lockstep Mode" (PDF). Intel. February 2014. p. 9. Retrieved 2015-01-25.

[1]

[2]

[3]

[4]

[5]