Redundant array of independent memory

Last updated

A redundant array of independent memory (RAIM) is a design feature found in certain computers' main random access memory. [1] RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and keep the memory system operating continuously. RAIM is similar in concept to a redundant array of independent disks (RAID), which protects against the failure of a disk drive, but in the case of memory it supports several DRAM device chipkills and entire memory channel failures. RAIM is much more robust than parity checking and ECC memory technologies which cannot protect against many varieties of memory failures.

On July 22, 2010, IBM introduced the first high end computer server featuring RAIM, the zEnterprise 196. Each z196 machine contains up to 3 TB (usable) of RAIM-protected main memory. In 2011 the business class model z114 was introduced also supporting RAIM. The formal announcement letter offered some additional information regarding the implementation:

... IBM's most robust error correction to date can be found in the memory subsystem. A new redundant array of independent memory (RAIM) technology is being introduced to provide protection at the dynamic random access memory (DRAM), dual inline memory module (DIMM), and memory channel level. Three full DRAM failures per rank can be corrected. DIMM level failures, including components such as the controller application specific integrated circuit (ASIC), the power regulators, the clocks, and the board, can be corrected. Memory channel failures such as signal lines, control lines, and drivers/receivers on the MCM can be corrected. Upstream and downstream data signals can be spared using two spare wires on both the upstream and downstream paths. One of these signals can be used to spare a clock signal line (one upstream and one downstream). Together these improvements are designed to deliver System z's most resilient memory subsystem to date. [2]

See also

Related Research Articles

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This was in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

Dynamic random-access memory random-access memory that stores each bit of data in a separate capacitor within an integrated circuit

Dynamic random-access memory (DRAM) is a type of random access semiconductor memory that stores each bit of data in a memory cell consisting of a tiny capacitor and a transistor, both typically based on metal-oxide-semiconductor (MOS) technology. The capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors slowly leaks off, so without intervention the data on the chip would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. This refresh process is the defining characteristic of dynamic random-access memory, in contrast to static random-access memory (SRAM) which does not require data to be refreshed. Unlike flash memory, DRAM is volatile memory, since it loses its data quickly when power is removed. However, DRAM does exhibit limited data remanence.

DIMM computer memory module that has separate electrical contacts on each side of the module and a 64-bit data path

A DIMM or dual in-line memory module comprises a series of dynamic random-access memory integrated circuits. These modules are mounted on a printed circuit board and designed for use in personal computers, workstations and servers. DIMMs began to replace SIMMs as the predominant type of memory module as Intel P5-based Pentium processors began to gain market share.

Rambus DRAM (RDRAM), and its successors Concurrent Rambus DRAM (CRDRAM) and Direct Rambus DRAM (DRDRAM), are types of synchronous dynamic random-access memory (SDRAM) developed by Rambus from the 1990s through to the early-2000s. The third-generation of Rambus DRAM, DRDRAM was replaced by XDR DRAM. Rambus DRAM was developed for high-bandwidth applications, and was positioned by Rambus as replacement for various types of contemporary memories, such as SDRAM.

Serial Storage Architecture (SSA) was a serial transport protocol used to attach disk drives to server computers.

HP Integrity is a series of server computers produced by Hewlett Packard Enterprise since 2003, based on the Itanium processor. The Integrity brand name was inherited by HP from Tandem Computers via Compaq.

In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.

Registered memory computer memory module containing a hardware buffer between the DRAM chips and the systems memory controller

Registeredmemory modules have a register between the DRAM modules and the system's memory controller. They place less electrical load on the memory controller and allow single systems to remain stable with more memory modules than they would have otherwise. When compared with registered memory, conventional memory is usually referred to as unbuffered memory or unregistered memory. When manufactured as a dual in-line memory module (DIMM), a registered memory module is called an RDIMM, while unregistered memory is called UDIMM or simply DIMM.

Reliability, availability and serviceability (RAS) is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

IBM Z Family name used by IBM for its non-POWER mainframe computers from the Z900 on

IBM Z is a family name used by IBM for all of its z/Architecture mainframe computers from the Z900 on. In July 2017, with another generation of products, the official family was changed to IBM Z from IBM z Systems; the IBM Z family now includes the newest model the IBM z15, as well as the z14 and the z13, the IBM zEnterprise models, the IBM System z10 models, the IBM System z9 models and IBM eServer zSeries models.

Fully Buffered DIMM memory technology

Fully Buffered DIMM is a memory technology that can be used to increase reliability and density of memory systems. Conventionally, data lines from the memory controller have to be connected to data lines in every DRAM module, i.e. via multidrop buses. As the memory width increases together with the access speed, the signal degrades at the interface between the bus and the device. This limits the speed and memory density, so FB-DIMMs take a different approach to solve the problem.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

Memory module discrete printed circuit board on which memory chips are mounted

In computing, a memory module is a printed circuit board on which memory integrated circuits are mounted. Memory modules permit easy installation and replacement in electronic systems, especially computers such as personal computers, workstations, and servers. The first memory modules were proprietary designs that were specific to a model of computer from a specific manufacturer. Later, memory modules were standardized by organizations such as JEDEC and could be used in any system designed to use them.

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components, and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

A memory rank is a set of DRAM chips connected to the same chip select, which are therefore accessed simultaneously. In practice all DRAM chips share all of the other command and control signals, and only the chip select pins for each rank are separate.

Universal Storage Platform (USP) was the brand name for an Hitachi Data Systems line of computer data storage disk arrays circa 2004 to 2010.

This is a glossary of terms relating to computer hardware – physical computer hardware, architectural issues, and peripherals.

IBM zEnterprise System A line of IBM mainframe computer systems

IBM zEnterprise System is an IBM mainframe designed to offer both mainframe and distributed server technologies in an integrated system. The zEnterprise System consists of three components. First is a System z server – a choice of the newest enterprise class server, the IBM zEnterprise EC12 that was announced August 28, 2012, the smaller business class server the IBM zEnterprise 114 (z114) announced July 2011, or the older enterprise-class server the IBM zEnterprise 196 (z196) that was introduced July 2010. Second is the IBM zEnterprise BladeCenter Extension (zBX), the infrastructure designed to provide logical integration and host IBM WebSphere DataPower Integrated Appliance XI50 for zEnterprise or general purpose x86 or Power ISA blades. Last is the management layer, IBM zEnterprise Unified Resource Manager (zManager), which provides a single management view of zEnterprise resources.

HyperCloud Memory (HCDIMM) is a DDR3 SDRAM Dual In-Line Memory Module (DIMM) used in server applications requiring a great deal of memory. It was initially launched in 2009 at the International Supercomputing Conference by Irvine, California based company, Netlist Inc. It was never a JEDEC standard, and the main server vendors supporting it were IBM and Hewlett Packard Enterprise.

IBM System/390 Line of mainframe computers

The IBM System/390 was the third generation of the System/360 instruction set architecture. The first ESA/390 computer was the Enterprise System/9000 (ES/9000) family, which were introduced in 1990. These were followed by the CMOS System/390 mainframe family in the mid-1990s. These systems followed the IBM 3090, with over a decade of follow-ons. The ESA/390 was succeeded by the 64-bit z/Architecture in 2000.

References

  1. Meaney, P. J. (Jan–Feb 2012). "IBM zEnterprise redundant array of independent memory subsystem". IBM Journal of Research and Development. 56: 4:1–4:11. doi:10.1147/jrd.2011.2177106.
  2. "Formal Announcement Letter for zEnterprise". IBM. 2010-07-22.