Reliability, availability and serviceability

Last updated

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers. [1] [2]

Contents

Computers designed with higher levels of RAS have many features that protect data integrity and help them stay available for long periods of time without failure [3] This data integrity and uptime is a particular selling point for mainframes and fault-tolerant systems.

Definitions

While RAS originated as a hardware-oriented[ citation needed ] term, systems thinking has extended the concept of reliability-availability-serviceability to systems in general, including software: [4]

Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption. [6]

Failure types

Physical faults can be temporary or permanent:

Failure responses

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay (see below). Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the operating system (OS) to provide information for predictive failure analysis.

Hardware features

Example hardware features for improving RAS include the following, listed by subsystem:

Fault-tolerant designs extended the idea by making RAS to be the defining feature of their computers for applications like stock market exchanges or air traffic control, where system crashes would be catastrophic. Fault-tolerant computers (e.g., see Tandem Computers and Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using distributed computing techniques like computer clusters, are often used as cheaper alternatives.[ citation needed ]

See also

Related Research Articles

<span class="mw-page-title-main">MVS</span> Operating system for IBM mainframes

Multiple Virtual Storage, more commonly called MVS, is the most commonly used operating system on the System/370, System/390 and IBM Z IBM mainframe computers. IBM developed MVS, along with OS/VS1 and SVS, as a successor to OS/360. It is unrelated to IBM's other mainframe operating system lines, e.g., VSE, VM, TPF.

<span class="mw-page-title-main">Mainframe computer</span> Large computer

A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise resource planning, and large-scale transaction processing. A mainframe computer is large but not as large as a supercomputer and has more processing power than some other classes of computers, such as minicomputers, servers, workstations, and personal computers. Most large-scale computer-system architectures were established in the 1960s, but they continue to evolve. Mainframe computers are often used as servers.

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

<span class="mw-page-title-main">Crash (computing)</span> When a computer program stops functioning properly and self-terminates

In computing, a crash, or system crash, occurs when a computer program such as a software application or an operating system stops functioning properly and exits. On some operating systems or individual applications, a crash reporting service will report the crash and any details relating to it, usually to the developer(s) of the application. If the program is a critical part of the operating system, the entire system may crash or hang, often resulting in a kernel panic or fatal system error.

Memory protection is a way to control memory access rights on a computer, and is a part of most modern instruction set architectures and operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug or malware within a process from affecting other processes, or the operating system itself. Protection may encompass all accesses to a specified area of memory, write accesses, or attempts to execute the contents of the area. An attempt to access unauthorized memory results in a hardware fault, e.g., a segmentation fault, storage violation exception, generally causing abnormal termination of the offending process. Memory protection for computer security includes additional techniques such as address space layout randomization and executable-space protection.

NonStop is a series of server computers introduced to market in 1976 by Tandem Computers Inc., beginning with the NonStop product line. It was followed by the Tandem Integrity NonStop line of lock-step fault-tolerant computers, now defunct. The original NonStop product line is currently offered by Hewlett Packard Enterprise since Hewlett-Packard Company's split in 2015. Because NonStop systems are based on an integrated hardware/software stack, Tandem and later HPE also developed the NonStop OS operating system for them.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

IBM Z Family name used by IBM for its z/Architecture mainframe computers

IBM Z is a family name used by IBM for all of its z/Architecture mainframe computers. In July 2017, with another generation of products, the official family was changed to IBM Z from IBM z Systems; the IBM Z family now includes the newest model, the IBM z16, as well as the z15, the z14, and the z13, the IBM zEnterprise models, the IBM System z10 models, the IBM System z9 models and IBM eServer zSeries models.

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

<span class="mw-page-title-main">ECC memory</span> Self-correcting computer data storage

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components, and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

In software engineering and hardware engineering, serviceability is one of the -ilities or aspects. It refers to the ability of technical support personnel to install, configure, and monitor computer products, identify exceptions or faults, debug or isolate faults to root cause analysis, and provide hardware or software maintenance in pursuit of solving a problem and restoring the product into service. Incorporating serviceability facilitating features typically results in more efficient product maintenance and reduces operational costs and maintains business continuity.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

References

  1. Siewiorek, Daniel P.; Swarz, Robert S. (1998). Reliable computer systems: design and evaluation. Taylor & Francis. p.  508. ISBN   9781568810928.. "The acronym RAS (reliability, accessibility and serviceability) came into widespread acceptance at IBM as the replacement for the subset notion of recovery management."
  2. Data Processing Division, International Business Machines Corp., 1970 (1970). "Data processor, Issues 13-17".{{cite journal}}: |author= has generic name (help); Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)- "The dependability [...] experienced by other System/370 users is the result of a strategy based on RAS (Reliability-Availability-Serviceability)"
  3. Siewert, Sam (Mar 2005). "Big iron lessons, Part 2: Reliability and availability: What's the difference?" (PDF).
  4. For example: Laros III, James H. (4 September 2012). Energy-Efficient High Performance Computing: Measurement and Tuning. SpringerBriefs in Computer Science. et al. Springer Science & Business Media (published 2012). p. 8. ISBN   9781447144922 . Retrieved 2014-07-08. Historically, Reliability Availability and Serviceability (RAS) systems were commonly provided by vendors on mainframe class systems. [...] The RAS system shall be a systematic union of software and hardware for the purpose of managing and monitoring all hardware and software components of the system to their individual potential.
  5. 1 2 3 E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.
  6. Spencer, Richard H.; Floyd, Raymond E. (11 July 2011). Perspectives on Engineering. Bloomington, Indiana: AuthorHouse (published 2011). p. 33. ISBN   9781463410919 . Retrieved 2014-05-05. [...] a system server may have excellent availability (runs forever), but continues to have frequent data corruption (not very reliable).
  7. Daniel Lipetz & Eric Schwarz (2011). "Self Checking in Current Floating-Point Units. Proceedings of 2011 20th IEEE Symposium on Computer Arithmetic" (PDF). Archived from the original (PDF) on 2012-01-24. Retrieved 2012-05-06.
  8. L. Spainhower & T. A. Gregg (September 1999). "IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective. IBM Journal of Research and Development. Volume 43 Issue 5" (PDF). CiteSeerX   10.1.1.85.5994 .
  9. "Intel Instruction Replay Technology Detects and Corrects Errors" . Retrieved 2012-12-07.
  10. HP. "Memory technology evolution: an overview of system memory technologies Technology brief, 9th edition (page 8)" (PDF). Archived from the original (PDF) on 2011-07-24.
  11. Intel Corp. (2003). "PCI Express Provides Enterprise Reliability, Availability, and Serviceability".
  12. "Best Practices for Data Reliability with Oracle VM Server for SPARC" (PDF). Retrieved 2013-07-02.
  13. "IBM Power Redundancy considerations" . Retrieved 2013-07-02.