Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code [lower-alpha 1] (ECC) to detect and correct n-bit data corruption which occurs in memory.
Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.
ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.
Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems.
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them. [2] Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes). [3] As a result, systems operating at high altitudes require special provisions for reliability.
As an example, the spacecraft Cassini–Huygens , launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a solar particle event that had been detected by the satellite GOES 9. [4]
There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since lower-energy particles will be able to change a memory cell's state. [3] On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies [5] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10 error/bit·h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit·h (roughly one bit error per millennium per gigabyte of memory). [5] [6] [7] A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance '09 conference. [6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.
The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes. [6] Memory errors can cause security vulnerabilities. [6] A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors. [8]
Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits. [9] [10]
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" (0011 1000 binary) has silently become a "9" (0011 1001).
Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting code, a single-error correction and double-error detection (SECDED) Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.
Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600. [11] Later, he included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers". The original IBM PC and all PCs until the early 1990s used parity checking. [12] Later ones mostly did not.
An ECC-capable memory controller can generally [lower-alpha 1] detect and correct errors of a single bit per word [lower-alpha 2] (the unit of bus transfer), and detect (but not correct) errors of two bits per word. The BIOS in some computers, when matched with operating systems such as some versions of Linux, BSD, and Windows (Windows 2000 and later [13] ), allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Some DRAM chips include "internal" on-chip error correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory. [14] [15] In some systems, a similar effect may be achieved by using EOS memory modules.
Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including IBM's Chipkill, Sun Microsystems' Extended ECC, Hewlett-Packard's Chipspare, and Intel's Single Device Data Correction (SDDC).
DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also "scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.
Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained. [16]
Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy (TMR). The latter is preferred because its hardware is faster than that of Hamming error correction scheme. [16] Space satellite systems often use TMR, [17] [18] [19] although satellite RAM usually uses Hamming error correction. [20]
Many early implementations of ECC memory mask correctable errors, acting "as if" the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events. [21]
Many ECC memory systems use an "external" EDAC circuit between the CPU and the memory. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct. [14] Modern desktop and server CPUs integrate the EDAC circuit into the CPU, [22] even before the shift toward CPU-integrated memory controllers, which are related to the NUMA architecture. CPU integration enables a zero-penalty EDAC system during error-free operation.
As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for protecting memory – double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error correcting and double-nibble error detecting (SNC-DND) codes, Reed–Solomon error correction codes, etc. However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes. [23] [24]
Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with odd weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area and delay. [25] [26] [27]
Many CPUs use error-correction codes in the on-chip cache, including the Intel Itanium, Xeon, Core and Pentium (since P6 microarchitecture) [28] [29] processors, the AMD Athlon, Opteron, all Zen- [30] and Zen+-based [31] processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264. [23] [32]
As of 2006 [update] , EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache. [33] CPUs that use the EDC/ECC technique always write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.
Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is usually neither, for economy. However, unbuffered (not-registered) ECC memory is available, [34] and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC. [35] Registered memory does not work reliably in motherboards without buffering circuitry, and vice versa.
Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost.
ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards, chipsets and processors that support ECC may also be more expensive.
ECC support varies among motherboard manufacturers so ECC memory may simply not be recognized by an ECC-incompatible motherboard. Most motherboards and processors for less critical applications are not designed to support ECC. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if ECC memory is installed.
ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking. [36] However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected. [22] [37] [38] This is not the case for in-band ECC which stores tables used for protection in a reserved region of main system memory, [39] [40] supported by Intel for Chromebooks, which showed little impact on web browsing and productivity tasks, but caused up to a 25% reduction in gaming and video editing benchmarks. [41]
ECC supporting memory may contribute to additional power consumption due to error correcting circuitry.
Double Data Rate Synchronous Dynamic Random-Access Memory is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR1 SDRAM, has been superseded by DDR2 SDRAM, DDR3 SDRAM, DDR4 SDRAM and DDR5 SDRAM. None of its successors are forward or backward compatible with DDR1 SDRAM, meaning DDR2, DDR3, DDR4 and DDR5 memory modules will not work on DDR1-equipped motherboards, and vice versa.
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.
Dynamic random-access memory is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal–oxide–semiconductor (MOS) technology. While most DRAM memory cell designs use a capacitor and transistor, some only use two transistors. In the designs where a capacitor is used, the capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors gradually leaks away; without intervention the data on the capacitor would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.
A DIMM, or Dual In-Line Memory Module, is a popular type of memory module used in computers. It is a printed circuit board with one or both sides holding DRAM chips and pins. The vast majority of DIMMs are standardized through JEDEC standards, although there are proprietary DIMMs. DIMMs come in a variety of speeds and sizes, but generally are one of two lengths - PC which are 133.35 mm (5.25 in) and laptop (SO-DIMM) which are about half the size at 67.60 mm (2.66 in).
A SIMM is a type of memory module used in computers from the early 1980s to the early 2000s. It is a printed circuit board on which has random-access memory attached to one or both sides. It differs from a dual in-line memory module (DIMM), the most predominant form of memory module since the late 1990s, in that the contacts on a SIMM are redundant on both sides of the module. SIMMs were standardised under the JEDEC JESD-21C standard.
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.
Memory scrubbing consists of reading from each computer memory location, correcting bit errors with an error-correcting code (ECC), and writing the corrected data back to the same location.
In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.
In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically, this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.
Registered memory is computer memory that has a register between the DRAM modules and the system's memory controller. A registered memory module places less electrical load on a memory controller compared to an unregistered one. Registered memory allows a computer system to remain stable with a higher number of memory modules than it would have otherwise.
Double Data Rate 3 Synchronous Dynamic Random-Access Memory is a type of synchronous dynamic random-access memory (SDRAM) with a high bandwidth interface, and has been in use since 2007. It is the higher-speed successor to DDR and DDR2 and predecessor to DDR4 synchronous dynamic random-access memory (SDRAM) chips. DDR3 SDRAM is neither forward nor backward compatible with any earlier type of random-access memory (RAM) because of different signaling voltages, timings, and other factors.
RAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random-access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.
A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.
Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.
A memory controller, also known as memory chip controller (MCC) or a memory controller unit (MCU), is a digital circuit that manages the flow of data going to and from a computer's main memory. When a memory controller is integrated into another chip, such as being placed on the same die or as an integral part of a microprocessor, it is usually called an integrated memory controller (IMC).
Random-access memory is a form of electronic computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory, in contrast with other direct-access data storage media, where the time required to read and write data items varies significantly depending on their physical locations on the recording medium, due to mechanical limitations such as media rotation speeds and arm movement.
In the design of modern computers, memory geometry describes the internal structure of random-access memory. Memory geometry is of concern to consumers upgrading their computers, since older memory controllers may not be compatible with later products. Memory geometry terminology can be confusing because of the number of overlapping terms.
Double Data Rate 5 Synchronous Dynamic Random-Access Memory is the latest type of synchronous dynamic random-access memory. Compared to its predecessor DDR4 SDRAM, DDR5 was planned to reduce power consumption, while doubling bandwidth. The standard, originally targeted for 2018, was released on July 14, 2020.
Row hammer is a computer security exploit that takes advantage of an unintended and undesirable side effect in dynamic random-access memory (DRAM) in which memory cells interact electrically between themselves by leaking their charges, possibly changing the contents of nearby memory rows that were not addressed in the original memory access. This circumvention of the isolation between DRAM memory cells results from the high cell density in modern DRAM, and can be triggered by specially crafted memory access patterns that rapidly activate the same memory rows numerous times.
{{cite book}}
: |journal=
ignored (help)!mca
". docs.microsoft.com. Retrieved 2021-03-27.