Memory scrubbing

Last updated

Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the corrected data back to the same location. [1]

Contents

Due to the high integration density of modern computer memory chips, the individual memory cell structures became small enough to be vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors. Over 8% of DIMM modules experience at least one correctable error per year. [2] This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small. However, together with the large amount of memory modern computersespecially servers are equipped with, and together with extended periods of uptime, the probability of soft errors in the total memory installed is significant.[ citation needed ]

The information in an ECC memory is stored redundantly enough to correct single bit error per memory word. Hence, an ECC memory can support the scrubbing of the memory content. Namely, if the memory controller scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum, and the corrected data can be written back to the memory.

Overview

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

On some systems, not only the main memory (DRAM-based) is capable of scrubbing but also the CPU caches (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than the main memory, the scrubbing for caches does not need to happen as frequently.

Memory scrubbing increases reliability, therefore it can be classified as a RAS feature.

Variants

There are usually two variants, known as patrol scrubbing and demand scrubbing. While they both essentially perform memory scrubbing and associated error correction (if it is doable), the main difference is how these two variants are initiated and executed. Patrol scrubbing runs in an automated manner when the system is idle, while demand scrubbing performs the error correction when the data is actually requested from main memory. [3]

See also

Related Research Articles

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Computer memory</span> Computer component that stores information for immediate use

Computer memory stores information, such as data and programs, for immediate use in the computer. The term memory is often synonymous with the terms RAM,main memory or primary storage. Archaic synonyms for main memory include core and store.

<span class="mw-page-title-main">DDR SDRAM</span> Type of computer memory

Double Data Rate Synchronous Dynamic Random-Access Memory is a double data rate (DDR) synchronous dynamic random-access memory (SDRAM) class of memory integrated circuits used in computers. DDR SDRAM, also retroactively called DDR1 SDRAM, has been superseded by DDR2 SDRAM, DDR3 SDRAM, DDR4 SDRAM and DDR5 SDRAM. None of its successors are forward or backward compatible with DDR1 SDRAM, meaning DDR2, DDR3, DDR4 and DDR5 memory modules will not work on DDR1-equipped motherboards, and vice versa.

<span class="mw-page-title-main">Error detection and correction</span> Techniques that enable reliable delivery of digital data over unreliable communication channels

In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

<span class="mw-page-title-main">Static random-access memory</span> Type of computer memory

Static random-access memory is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory; data is lost when power is removed.

<span class="mw-page-title-main">Dynamic random-access memory</span> Type of computer memory

Dynamic random-access memory is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal–oxide–semiconductor (MOS) technology. While most DRAM memory cell designs use a capacitor and transistor, some only use two transistors. In the designs where a capacitor is used, the capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors gradually leaks away; without intervention the data on the capacitor would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.

<span class="mw-page-title-main">DIMM</span> Computer memory module

A DIMM, or Dual In-Line Memory Module, is a type of computer memory module used in desktop, laptop, and server computers. It is a circuit board that contains memory chips and connects to the computer's motherboard. A DIMM is often called a "RAM stick" due to its shape and size. A DIMM comprises a series of dynamic random-access memory integrated circuits that are mounted to its circuit board. DIMMs are the predominant method for adding memory into a computer system. The vast majority of DIMMs are standardized through JEDEC standards, although there are proprietary DIMMs. DIMMs come in a variety of speeds and sizes, but generally are one of two lengths - PC which are 133.35 mm (5.25 in) and laptop (SO-DIMM) which are about half the size at 67.60 mm (2.66 in).

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

<span class="mw-page-title-main">Data corruption</span> Errors in computer data that introduce unintended changes to the original data

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.

In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

<span class="mw-page-title-main">1T-SRAM</span> Pseudo-static random-access memory technology introduced by MoSys Inc.

1T-SRAM is a pseudo-static random-access memory (PSRAM) technology introduced by MoSys, Inc. in September 1998, which offers a high-density alternative to traditional static random-access memory (SRAM) in embedded memory applications. Mosys uses a single-transistor storage cell like dynamic random-access memory (DRAM), but surrounds the bit cell with control circuitry that makes the memory functionally equivalent to SRAM. 1T-SRAM has a standard single-cycle SRAM interface and appears to the surrounding logic just as an SRAM would.

RAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random-access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.

Memory refresh is the process, for the purpose of preserving the information, of periodically reading information from an area of computer memory and immediately rewriting the read information to the same area without modification. Memory refresh is a background maintenance process required during the operation of semiconductor dynamic random-access memory (DRAM), the most widely used type of computer memory, and in fact is the defining characteristic of this class of memory.

Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.

<span class="mw-page-title-main">ECC memory</span> Self-correcting computer data storage

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory.

A memory controller is a digital circuit that manages the flow of data going to and from a computer's main memory. A memory controller can be a separate chip or integrated into another chip, such as being placed on the same die or as an integral part of a microprocessor; in the latter case, it is usually called an integrated memory controller (IMC). A memory controller is sometimes also called a memory chip controller (MCC) or a memory controller unit (MCU).

<span class="mw-page-title-main">Memory module</span>

In computing, a memory module or RAM stick is a printed circuit board on which memory integrated circuits are mounted. Memory modules permit easy installation and replacement in electronic systems, especially computers such as personal computers, workstations, and servers. The first memory modules were proprietary designs that were specific to a model of computer from a specific manufacturer. Later, memory modules were standardized by organizations such as JEDEC and could be used in any system designed to use them.

<span class="mw-page-title-main">Random-access memory</span> Form of computer data storage

Random-access memory is a form of electronic computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory, in contrast with other direct-access data storage media, where the time required to read and write data items varies significantly depending on their physical locations on the recording medium, due to mechanical limitations such as media rotation speeds and arm movement.

<span class="mw-page-title-main">DDR5 SDRAM</span> Fifth generation of double-data-rate synchronous dynamic random-access memory

Double Data Rate 5 Synchronous Dynamic Random-Access Memory is a type of synchronous dynamic random-access memory. Compared to its predecessor DDR4 SDRAM, DDR5 was planned to reduce power consumption, while doubling bandwidth. The standard, originally targeted for 2018, was released on July 14, 2020.

Row hammer is a security exploit that takes advantage of an unintended and undesirable side effect in dynamic random-access memory (DRAM) in which memory cells interact electrically between themselves by leaking their charges, possibly changing the contents of nearby memory rows that were not addressed in the original memory access. This circumvention of the isolation between DRAM memory cells results from the high cell density in modern DRAM, and can be triggered by specially crafted memory access patterns that rapidly activate the same memory rows numerous times.

References

  1. Ronald K. Burek. "The NEAR Solid-State Data Recorders". Johns Hopkins APL Technical Digest. 1998.
  2. DRAM Errors in the Wild: A Large-Scale Field Study
  3. "Supermicro X9SRA motherboard manual" (PDF). Supermicro. March 5, 2014. pp. 4–10. Retrieved February 22, 2015.