Predictive failure analysis

Last updated

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

Contents

For example, computer mechanisms that analyze trends in corrected errors to predict future failures of hardware/memory components and proactively enabling mechanisms to avoid them. Predictive Failure Analysis was originally used as term for a proprietary IBM technology for monitoring the likelihood of hard disk drives to fail, although the term is now used generically for a variety of technologies for judging the imminent failure of CPU's, memory and I/O devices. [1] See also first failure data capture.

Disks

IBM introduced the term PFA and its technology in 1992 with reference to its 0662-S1x drive (1052 MB Fast-Wide SCSI-2 disk which operated at 5400 rpm).

The technology relies on measuring several key (mainly mechanical) parameters of the drive unit, for example the flying height of heads. The drive firmware compares the measured parameters against predefined thresholds and evaluates the health status of the drive. If the drive appears likely to fail soon, the system sends notification to the disk controller.

The major drawbacks of the technology included:

The technology merged with IntelliSafe to form the Self-Monitoring, Analysis, and Reporting Technology (SMART).

Processor and Memory

High counts of corrected RAM intermittent errors by ECC can be predictive of future DIMM failures [2] and so automatic offlining for memory and CPU caches can be used to avoid future errors, [3] for example under the Linux operating system the mcelog daemon will automatically remove from usage memory pages showing excessive corrections, and will remove from usage processor cores showing excessive cache correctable memory errors. [4]

Optical media

On optical media (CD, DVD and Blu-ray), failures caused by degradation of media can be predicted and media of low manufacturing quality can be detected prior to data loss occurring by measuring the rate of correctable data errors using software such as QpxTool or Nero DiscSpeed. However, not all vendors and models of optical drives allow error scanning. [5]

Related Research Articles

<span class="mw-page-title-main">BIOS</span> Firmware for hardware initialization and OS runtime services

In computing, BIOS is firmware used to provide runtime services for operating systems and programs and to perform hardware initialization during the booting process. The BIOS firmware comes pre-installed on an IBM PC or IBM PC compatible's system board and exists in some UEFI-based systems to maintain compatibility with operating systems that do not support UEFI native operation. The name originates from the Basic Input/Output System used in the CP/M operating system in 1975. The BIOS originally proprietary to the IBM PC has been reverse engineered by some companies looking to create compatible systems. The interface of that original system serves as a de facto standard.

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

In computer jargon, a killer poke is a method of inducing physical hardware damage on a machine or its peripherals by the insertion of invalid values, via, for example, BASIC's POKE command, into a memory-mapped control register. The term is typically used to describe a family of fairly well known tricks that can overload the analog electronics in the CRT monitors of computers lacking hardware sanity checking

<span class="mw-page-title-main">Booting</span> Process of starting a computer

In computing, booting is the process of starting a computer as initiated via hardware such as a button or by a software command. After it is switched on, a computer's central processing unit (CPU) has no software in its main memory, so some process must load software into memory before it can be executed. This may be done by hardware or firmware in the CPU, or by a separate processor in the computer system.

<span class="mw-page-title-main">Firmware</span> Low-level computer software

In computing, firmware is a specific class of computer software that provides the low-level control for a device's specific hardware. Firmware, such as the BIOS of a personal computer, may contain basic functions of a device, and may provide hardware abstraction services to higher-level software such as operating systems. For less complex devices, firmware may act as the device's complete operating system, performing all control, monitoring and data manipulation functions. Typical examples of devices containing firmware are embedded systems, home and personal-use appliances, computers, and computer peripherals.

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).

<span class="mw-page-title-main">Memory hierarchy</span> Computer memory architecture

In computer organisation, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

<span class="mw-page-title-main">Network-attached storage</span> Computer data storage server

Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. The term "NAS" can refer to both the technology and systems involved, or a specialized device built for such functionality.

<span class="mw-page-title-main">Self-Monitoring, Analysis and Reporting Technology</span> Monitoring system in computer drives

Self-Monitoring, Analysis, and Reporting Technology is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs). Its primary function is to detect and report various indicators of drive reliability with the intent of anticipating imminent hardware failures.

<span class="mw-page-title-main">Data corruption</span> Errors in computer data that introduce unintended changes to the original data

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.

<span class="mw-page-title-main">Native Command Queuing</span>

In computing, Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed. This can reduce the amount of unnecessary drive head movement, resulting in increased performance for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications.

BIOS implementations provide interrupts that can be invoked by operating systems and application programs to use the facilities of the firmware on IBM PC compatible computers. Traditionally, BIOS calls are mainly used by DOS programs and some other software such as boot loaders. BIOS runs in the real address mode of the x86 CPU, so programs that call BIOS either must also run in real mode or must switch from protected mode to real mode before calling BIOS and then switching back again. For this reason, modern operating systems that use the CPU in Protected mode or Long mode generally do not use the BIOS interrupt calls to support system functions, although they use the BIOS interrupt calls to probe and initialize hardware during booting. Real mode has the 1MB memory limitation, modern boot loaders use the unreal mode or protected mode to access up to 4GB memory.

<span class="mw-page-title-main">QEMU</span> Free virtualization and emulation software

QEMU is a free and open-source emulator. It emulates a computer's processor through dynamic binary translation and provides a set of different hardware and device models for the machine, enabling it to run a variety of guest operating systems. It can interoperate with Kernel-based Virtual Machine (KVM) to run virtual machines at near-native speed. QEMU can also do emulation for user-level processes, allowing applications compiled for one architecture to run on another.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

The Linux booting process involves multiple stages and is in many ways similar to the BSD and other Unix-style boot processes, from which it derives. Although the Linux booting process depend very much on the computer architecture, those architectures share similar stages and software components, including system startup, bootloader execution, loading and startup of a Linux kernel image, and execution of various startup scripts and daemons. Those are grouped into 4 steps: system startup, bootloader stage, kernel stage, and init process. When a Linux system is powered up or reset, its processor will execute a specific firmware/program for system initialization, such as Power-on self-test, invoking the reset vector to start a program at a known address in flash/ROM, then load the bootloader into RAM for later execution. In personal computer (PC), not only limited to Linux-distro PC, this firmware/program is called BIOS, which is stored in the mainboard. In embedded Linux system, this firmware/program is called boot ROM. After being loaded into RAM, bootloader will execute to load the second-stage bootloader. The second-stage bootloader will load the kernel image into memory, decompress and initialize it then pass control to this kernel image. Second-stage bootloader also performs several operation on the system such as system hardware check, mounting the root device, loading the necessary kernel modules,... Finally, the very first user-space process starts, and other high-level system initializations are performed.

In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.

<span class="mw-page-title-main">Intel Rapid Storage Technology</span> Computer storage device

Intel Rapid Storage Technology (RST) is a driver SATA AHCI and a firmware-based RAID solution built into a wide range of Intel chipsets. Currently also is installed as a driver for Intel Optane temporary storage units.

<span class="mw-page-title-main">Meltdown (security vulnerability)</span> Microprocessor security vulnerability

Meltdown is one of the two original transient execution CPU vulnerabilities. Meltdown affects Intel x86 microprocessors, IBM POWER processors, and some ARM-based microprocessors. It allows a rogue process to read all memory, even when it is not authorized to do so.

References

  1. Intel Corp (2011). "Intel Xeon Processor E7 Family: supporting next generation RAS servers. White paper" . Retrieved 9 May 2012.
  2. Bianca Schroeder; Eduardo Pinheiro; Wolf-Dietrich Weber (2009). "DRAM Errors in the Wild: A Large-Scale Field Study. Proceedings SIGMETRICS, 2009".
  3. Tang, Arruthers, Totari, Shapiro (2006). ""Assessment of the Effect of Memory Page Retirement on Systems RAS against Hardware Faults", Proceedings of the 2006 International Conference on Dependable Systems and Networks".{{cite news}}: CS1 maint: multiple names: authors list (link)
  4. "mcelog - memory error handling in user space. Linux Kongress 2010" (PDF). 2010.
  5. List of supported devices by dosc quality scanning software QPxTool

See also