Chipkill

Last updated March 09, 2024

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip.^[1]^[2] One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of RAID, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and middle 1990s. An important RAS feature, Chipkill technology is deployed primarily on SSDs, mainframes and midrange servers.

An equivalent system from Sun Microsystems is called Extended ECC, while equivalent systems from HP are called Advanced ECC^[3] and Chipspare. A similar system from Intel, called Lockstep memory , provides double-device data correction (DDDC) functionality.^[4] Similar systems from Micron, called redundant array of independent NAND (RAIN), and from SandForce, called RAISE level 2, protect data stored on SSDs from any single NAND flash chip going bad.^[5]^[6]

A 2009 paper using data from Google's datacentres^[7] provided evidence demonstrating that in observed Google systems, DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month". DIMMs with chipkill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error correcting codes that can only correct single-bit errors. A 2010 paper from University of Rochester also showed that Chipkill memory gave substantially lower memory errors, using both real world memory traces and simulations.^[8]

Related Research Articles

Double Data Rate Synchronous Dynamic Random-Access Memory is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR1 SDRAM, has been superseded by DDR2 SDRAM, DDR3 SDRAM, DDR4 SDRAM and DDR5 SDRAM. None of its successors are forward or backward compatible with DDR1 SDRAM, meaning DDR2, DDR3, DDR4 and DDR5 memory modules will not work on DDR1-equipped motherboards, and vice versa.

Flash memory is an electronic non-volatile computer memory storage medium that can be electrically erased and reprogrammed. The two main types of flash memory, NOR flash and NAND flash, are named for the NOR and NAND logic gates. Both use the same cell design, consisting of floating gate MOSFETs. They differ at the circuit level depending on whether the state of the bit line or word lines is pulled high or low: in NAND flash, the relationship between the bit line and the word lines resembles a NAND gate; in NOR flash, it resembles a NOR gate.

<span class="mw-page-title-main">DIMM</span> Computer memory module

A DIMM, or Dual In-Line Memory Module, is a type of computer memory module used in desktop, laptop, and server computers. It is a circuit board that contains memory chips and connects to the computer's motherboard. A DIMM is often called a "RAM stick" due to its shape and size. A DIMM comprises a series of dynamic random-access memory integrated circuits that are mounted to its circuit board. DIMMs are the predominant method for adding memory into a computer system. The vast majority of DIMMs are standardized through JEDEC standards, although there are proprietary DIMMs. DIMMs come in a variety of speeds and sizes, but generally are one of two lengths - PC which are 133.35 mm (5.25 in) and laptop (SO-DIMM) which are about half the size at 67.60 mm (2.66 in).

Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. The phenomenon is also known as data decay, data rot or bit rot. This process leads to the slow deterioration of data quality over time, even if the data is not actively being used or accessed.

Memory scrubbing consists of reading from each computer memory location, correcting bit errors with an error-correcting code (ECC), and writing the corrected data back to the same location.

In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically, this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.

Registered memory is computer memory that has a register between the DRAM modules and the system's memory controller. A registered memory module places less electrical load on a memory controller compared to an unregistered one. Registered memory allows a computer system to remain stable with a higher number of memory modules than it would have otherwise.

RAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random-access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

A memory controller is a digital circuit that manages the flow of data going to and from a computer's main memory. A memory controller can be a separate chip or integrated into another chip, such as being placed on the same die or as an integral part of a microprocessor; in the latter case, it is usually called an integrated memory controller (IMC). A memory controller is sometimes also called a memory chip controller (MCC) or a memory controller unit (MCU).

A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functions as secondary storage in the hierarchy of computer storage. It is also sometimes called a semiconductor storage device, a solid-state device, or a solid-state disk, even though SSDs lack the physical spinning disks and movable read-write heads used in hard disk drives (HDDs) and floppy disks. SSD also has rich internal parallelism for data processing.

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components, and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

In electronics, a multi-level cell (MLC) is a memory cell capable of storing more than a single bit of information, compared to a single-level cell (SLC), which can store only one bit per memory cell. A memory cell typically consists of a single floating-gate MOSFET, thus multi-level cells reduce the number of MOSFETs required to store the same amount of data as single-level cells.

For computer memory, Memory ProteXion, found in IBM xSeries servers, is a form of "redundant bit steering". This technology uses redundant bits in a data packet to recover from a DIMM failure.

A redundant array of independent memory (RAIM) is a design feature found in certain computers' main random access memory. RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and keep the memory system operating continuously. RAIM is similar in concept to a redundant array of independent disks (RAID), which protects against the failure of a disk drive, but in the case of memory it supports several DRAM device chipkills and entire memory channel failures. RAIM is much more robust than parity checking and ECC memory technologies which cannot protect against many varieties of memory failures.

A NVDIMM or non-volatile DIMM is a type of persistent random-access memory for computers using widely used DIMM form-factors. Non-volatile memory is memory that retains its contents even when electrical power is removed, for example from an unexpected power loss, system crash, or normal shutdown. Properly used, NVDIMMs can improve application performance and system crash recovery time. They enhance solid-state drive (SSD) endurance and reliability.

Double Data Rate 5 Synchronous Dynamic Random-Access Memory is a type of synchronous dynamic random-access memory. Compared to its predecessor DDR4 SDRAM, DDR5 was planned to reduce power consumption, while doubling bandwidth. The standard, originally targeted for 2018, was released on July 14, 2020.

IBM FlashCore Modules (FCM) are solid state technology computer data storage modules using PCI Express attachment and the NVMe command set. They are offered as an alternative to industry-standard 2.5" NVMe SSDs in selected arrays from the IBM FlashSystem family, with raw storage capacities of 4.8 TB, 9.6 TB, 19.2 TB and 38.4 TB. FlashCore modules support hardware self-encryption and real-time inline hardware data compression up to 115.2 TB address space, without performance impact.

References

↑ Timothy J. Dell (1997-11-19). "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (PDF). IBM. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.
↑ "Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory" (PDF). IBM. 2000. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.
↑ "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. p. 8. Retrieved 2014-09-09.
↑ Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel . Retrieved 2015-02-02.
↑ Lee Hutchinson. "Solid-state revolution: in-depth on how SSDs really work". 2012.
↑ Eric Slack. "How to Make Reliable SSDs - Reliable NAND Flash".
↑ Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: A large-scale field study" (PDF). Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09. ACM. pp. 193–204. doi:10.1145/1555349.1555372. ISBN 9781605585116. S2CID 6115552 . Retrieved 7 September 2011.
↑ Li, Xin; Huang, Michael; Shen, Kai; Lingkun, Chu (2010). ""A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". Usenix Annual Tech Conference 2010" (PDF).

External links

Intel E7500 Chipset MCH Intelx4 Single Device Data Correction (x4 SDDC) Implementation and Validation, Intel Application note AP-726, August 2002.
DRAM study turns assumptions about errors upside down, Ars Technica, October 7, 2009
Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers, 2005
Chipkill correct memory architecture, August 2000, by David Locklear
The Mathematics of Chipkill ECC, October 2015, by Bob Day

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Timothy J. Dell (1997-11-19). "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (PDF). IBM. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.

[2] "Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory" (PDF). IBM. 2000. Archived from the original (PDF) on 2015-09-23. Retrieved 2015-02-02.

[3] "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. p. 8. Retrieved 2014-09-09.

[4] Thomas Willhalm (2014-07-11). "Independent Channel vs. Lockstep Mode – Drive your Memory Faster or Safer". Intel . Retrieved 2015-02-02.

[5] Lee Hutchinson. "Solid-state revolution: in-depth on how SSDs really work". 2012.

[6] Eric Slack. "How to Make Reliable SSDs - Reliable NAND Flash".

[7] Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: A large-scale field study" (PDF). Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09. ACM. pp. 193–204. doi:10.1145/1555349.1555372. ISBN 9781605585116. S2CID 6115552 . Retrieved 7 September 2011.

[8] Li, Xin; Huang, Michael; Shen, Kai; Lingkun, Chu (2010). ""A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". Usenix Annual Tech Conference 2010" (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Chipkill

Contents

See also

Related Research Articles

References

External links