Lockstep (computing)

Last updated

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. [1] The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems (dual modular redundancy), and the error can be automatically corrected if there are at least three systems (triple modular redundancy), via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

Contents

To run in lockstep, each system is set up to progress from one well-defined state to the next well-defined state. When a new set of inputs reaches the system, it processes them, generates new outputs and updates its state. This set of changes (new inputs, new outputs, new state) is considered to define that step, and must be treated as an atomic transaction; in other words, either all of it happens, or none of it happens, but not something in between. Sometimes a timeshift (delay) is set between systems, which increases the detection probability of errors induced by external influences (e.g. voltage spikes, ionizing radiation, or in situ reverse engineering).

Lockstep memory

Some vendors, including Intel, use the term lockstep memory to describe a multi-channel memory layout in which cache lines are distributed between two memory channels, so one half of the cache line is stored in a DIMM on the first channel, while the second half goes to a DIMM on the second channel. By combining the single error correction and double error detection (SECDED) capabilities of two ECC-enabled DIMMs in a lockstep layout, their single-device data correction (SDDC) nature can be extended into double-device data correction (DDDC), providing protection against the failure of any single memory chip. [2] [3] [4] [5]

Downsides of the Intel's lockstep memory layout are the reduction of effectively usable amount of RAM (in case of a triple-channel memory layout, maximum amount of memory reduces to one third of the physically available maximum), and reduced performance of the memory subsystem. [2] [4]

Dual modular redundancy

Where the computing systems are duplicated, but both actively process each step, it is difficult to arbitrate between them if their outputs differ at the end of a step. For this reason, it is common practice to run DMR systems as "master/slave" configurations with the slave as a "hot-standby" to the master, rather than in lockstep. Since there is no advantage in having the slave unit actively process each step, a common method of working is for the master to copy its state at the end of each step's processing to the slave. Should the master fail at some point, the slave is ready to continue from the previous known good step.

While either the lockstep or the DMR approach (when combined with some means of detecting errors in the master) can provide redundancy against hardware failure in the master, they do not protect against software failure. If the master fails because of a software error, it is highly likely that the slave - in attempting to repeat the execution of the step which failed - will simply repeat the same error and fail in the same way, an example of a common mode failure.

Triple modular redundancy

Where the computing systems are triplicated, it becomes possible to treat them as "voting" systems. If one unit's output disagrees with the other two, it is detected as having failed. The matched output from the other two is treated as correct.

See also

Related Research Articles

Error detection and correction Techniques that enable reliable delivery of digital data over unreliable communication channels

In information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett Packard into HP Inc. and Hewlett Packard Enterprise.

Xeon

Xeon is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same architecture as regular desktop-grade CPUs, but have advanced features such as support for ECC memory, higher core counts, support for larger amounts of RAM, larger cache memory and extra provision for enterprise-grade reliability, availability and serviceability (RAS) features responsible for handling hardware exceptions through the Machine Check Architecture. They are often capable of safely continuing execution where a normal processor cannot due to these extra RAS features, depending on the type and severity of the machine-check exception (MCE). Some also support multi-socket systems with two, four, or eight sockets through use of the Quick Path Interconnect (QPI) bus.

Altix

Altix is a line of server computers and supercomputers produced by Silicon Graphics, based on Intel processors. It succeeded the MIPS/IRIX-based Origin 3000 servers.

Redundancy (engineering) Duplication of critical components to increase reliability of a system

In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Fully Buffered DIMM memory technology

Fully Buffered DIMM is a memory technology that can be used to increase reliability and density of memory systems. Unlike the parallel bus architecture of traditional DRAMs, an FB-DIMM has a serial interface between the memory controller and the advanced memory buffer (AMB). Conventionally, data lines from the memory controller have to be connected to data lines in every DRAM module, i.e. via multidrop buses. As the memory width increases together with the access speed, the signal degrades at the interface between the bus and the device. This limits the speed and memory density, so FB-DIMMs take a different approach to solve the problem.

ECC memory Self-correcting computer data storage

Error-correcting code memory is a type of computer data storage that can detect and correct the most-common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.

IBM BladeCenter IBMs blade server architecture

The IBM BladeCenter was IBM's blade server architecture, until it was replaced by Flex System. The x86 division was later sold to Lenovo in 2014.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

Mac Pro Line of workstation and server computers for professionals

The Mac Pro is a series of workstations and servers for professionals designed, manufactured, and sold by Apple Inc. since 2006. The Mac Pro, in most configurations and in terms of speed and performance, is the most powerful computer that Apple offers. It is one of four desktop computers in the current Macintosh lineup, sitting above the consumer Mac Mini and iMac, and alongside the all-in-one iMac Pro.

The memory controller is a digital circuit that manages the flow of data going to and from the computer's main memory. A memory controller can be a separate chip or integrated into another chip, such as being placed on the same die or as an integral part of a microprocessor; in the latter case, it is usually called an integrated memory controller (IMC). A memory controller is sometimes also called a memory chip controller (MCC) or a memory controller unit (MCU).

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components, and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

Triple modular redundancy

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Bloomfield (microprocessor)

Bloomfield is the code name for Intel high-end desktop processors sold as Core i7-9xx and single-processor servers sold as Xeon 35xx., in almost identical configurations, replacing the earlier Yorkfield processors. The Bloomfield core is closely related to the dual-processor Gainestown, which has the same CPUID value of 0106Ax and which uses the same socket. Bloomfield uses a different socket than the later Lynnfield and Clarksfield processors based on the same 45 nm Nehalem microarchitecture, even though some of these share the same Intel Core i7 brand.

Skylake (microarchitecture)

Skylake is the codename used by Intel for a processor microarchitecture that was launched in August 2015 succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology as its predecessor, serving as a "tock" in Intel's "tick–tock" manufacturing and design model. According to Intel, the redesign brings greater CPU and GPU performance and reduced power consumption. Skylake CPUs share their microarchitecture with Kaby Lake, Coffee Lake, Cannon Lake, Whiskey Lake, and Comet Lake CPUs.

Cascade Lake is an Intel codename for a 14 nanometer server, workstation and enthusiast processor microarchitecture, launched in April 2019. In Intel's Process-Architecture-Optimization model, Cascade Lake is an optimization of Skylake. Intel states that this will be their first microarchitecture to support 3D XPoint-based memory modules. It also features Deep Learning Boost instructions and mitigations for Meltdown and Spectre. Intel officially launched new Xeon Scalable SKUs on February 24, 2020.

References

  1. Stefan Poledna (1996). Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. books.google.com. p. 80. ISBN   9780585295800 . Retrieved 2014-09-08.
  2. 1 2 Sree Syamalakumari (2014-02-18). "Intel Xeon Processor E7 V2 Family Technical Overview, Section 3.1: Intel C104/102 Scalable Memory Buffer". Intel . Retrieved 2014-09-09.
  3. Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel . Retrieved 2014-09-09.
  4. 1 2 "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. pp. 8–9. Retrieved 2014-09-09.
  5. "Intel C102/C104 Scalable Memory Buffer Datasheet, Section 1.3.1.2.2: 1:1 Sub-channel Lockstep Mode" (PDF). Intel. February 2014. p. 9. Retrieved 2015-01-25.