Recovery-oriented computing

Last updated

Recovery-oriented computing (sometimes abbreviated to ROC) is a method constructed at Stanford University and the University of California, Berkeley for developing reliable Internet services. Its proponents seek to recognize computer bugs as inevitable, and then reduce their harmful effects. The National Science Foundation funds the project.

Contents

There are characteristics that set recovery oriented computing apart from all other failure handling techniques.

Isolation and redundancy

Isolation in these types of systems requires redundancy. Should one part of the system fail, a redundant part will need to take its place. Isolation must be failure proof for all types of failures whether they be software or human caused failures. One potential way to isolate parts of a system is using virtual machine monitors such as Xen. Virtual machine monitors allow many virtual machines to run on a physical machine and should there be a problem with one virtual machine it can be restarted without restarting the physical machine, or it can be stopped and another can take its place.

System-wide undo support

The ability to undo across different programs and time frames is necessary in this type of system because human error is the cause of about half of system failures. [1] Not having undo support also limits testing aspects of a production system because it doesn’t allow for trial and error.

System-wide undo support should cover all aspects of the system. This includes hardware and software upgrades, configuration as well as application management. There are obviously limits to what can be undone, and these limits are currently being explored, tested and rated based on their tradeoffs.

Integrated diagnostic support

Integrated diagnostic support is another characteristic a recovery-oriented computer should have. This means that the system should be able to identify the root cause of a system failure. Once it does this it should then either be able to contain the failure so it cannot affect other parts of the system or alternatively it should repair the failure. All of the system components or modules should be self-testing; it should be able to know when there is something wrong with itself. As well as determining problems with themselves, the modules should also be able to verify the behavior of other modules that they are dependent upon. The system must also track module, resource, and user request dependencies throughout the system. This will allow for containment of failures.

Online verification and recovery mechanisms

Recovery mechanisms are ways in which the systems can recover from failures. These recovery mechanisms should be well designed, meaning that they are reliable, effective and efficient. These systems should be proactive in testing and verifying the behavior of the recovery mechanisms so should there be a real failure it is certain that these mechanisms will do what they are designed to do and aid in the recovery of the system. These verifications should be performed even in production level equipment as this type of equipment is the most vital to have up. There are two methods for performing these tests and both of these should be used. The first method is directed tests in which the tests are set up and executed. The other method is a random test in which they occur without warning.

Modularity, measurability and restartability

Software aging problems are best resolved by restarting the component that is affected. This entails both modularity and restartability. Components should be restarted before they fail, and designed to make this option available or better yet, do it automatically. Applications should also be designed for restartability.

Benchmarks

These systems should have frequent dependability and availability benchmarking to justify their existence and usage by tracking their progress. These benchmarks should be reproducible and an impartial measure of system dependability, reliability, and availability.

See also

Related Research Articles

Software testing is an investigation conducted to provide stakeholders with information about the quality of the software product or service under test. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software implementation. Test techniques include the process of executing a program or application with the intent of finding software bugs, and verifying that the software product is fit for use.

Design by contract approach for designing software

Design by contract (DbC), also known as contract programming, programming by contract and design-by-contract programming, is an approach for designing software.

Software design is the process by which an agent creates a specification of a software artifact, intended to accomplish goals, using a set of primitive components and subject to constraints. Software design may refer to either "all the activity involved in conceptualizing, framing, implementing, commissioning, and ultimately modifying complex systems" or "the activity following requirements specification and before programming, as ... [in] a stylized software engineering process."

In computer science, separation of concerns (SoC) is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern. A concern is a set of information that affects the code of a computer program. A concern can be as general as "the details of the hardware for an application", or as specific as "the name of which class to instantiate". A program that embodies SoC well is called a modular program. Modularity, and hence separation of concerns, is achieved by encapsulating information inside a section of code that has a well-defined interface. Encapsulation is a means of information hiding. Layered designs in information systems are another embodiment of separation of concerns.

Test-driven development (TDD) is a software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the code is improved so that the tests pass. This is opposed to software development that allows code to be added that is not proven to meet requirements.

In computing, an interface is a shared boundary across which two or more separate components of a computer system exchange information. The exchange can be between software, computer hardware, peripheral devices, humans, and combinations of these. Some computer hardware devices, such as a touchscreen, can both send and receive data through the interface, while others such as a mouse or microphone may only provide an interface to send data to a given system.

In systems engineering, dependability is a measure of a system's availability, reliability, and its maintainability, and maintenance support performance, and, in some cases, other characteristics such as durability, safety and security. In software engineering, dependability is the ability to provide services that can defensibly be trusted within a time-period. This may also encompass mechanisms designed to increase and maintain the dependability of a system or software.

In software project management, software testing, and software engineering, verification and validation (V&V) is the process of checking that a software system meets specifications and that it fulfills its intended purpose. It may also be referred to as software quality control. It is normally the responsibility of software testers as part of the software development lifecycle. In simple terms, software verification is: "Assuming we should build X, does our software achieve its goals without any bugs or gaps?" On the other hand, software validation is: "Was X what we should have built? Does X meet the high level requirements?"

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Microrebooting is a technique used to recover from failures in crash-only software systems. Instead of rebooting the whole system, only subsets of fine-grain components are restarted. The granularity of components is typically finer than the process level.

Reliability engineering is a sub-discipline of systems engineering that emphasizes dependability in the lifecycle management of a product. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS) is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Component-based software engineering branch of software engineering

Component-based software engineering (CBSE), also called components-based development (CBD), is a branch of software engineering that emphasizes the separation of concerns with respect to the wide-ranging functionality available throughout a given software system. It is a reuse-based approach to defining, implementing and composing loosely coupled independent components into systems. This practice aims to bring about an equally wide-ranging degree of benefits in both the short-term and the long-term for the software itself and for organizations that sponsor such software.

Fault injection is a testing technique which aids in understanding how [virtual/real] system behaves when stressed in unusual ways. This technique is based on simulation's or experiment's result, thus it may be more valid compared to statistical methods.

Zhiming Liu (computer scientist) computer scientist

Prof. Zhiming Liu was a computer scientist. He studied mathematics in Luoyang, Henan in China and obtained his first degree in 1982. He holds a master's degree in Computer Science from the Institute of Software of the Chinese Academy of Sciences (1988), and a PhD degree from the University of Warwick (1991). His PhD thesis was on Fault-Tolerant Programming by Transformations.

Computer appliance computer with software or firmware that is specifically designed to provide a specific computing resource

A computer appliance is a computer with software or firmware that is specifically designed to provide a specific computing resource. Such devices became known as appliances because of the similarity in role or management to a home appliance, which are generally closed and sealed, and are not serviceable by the user or owner. The hardware and software are delivered as an integrated product and may even be pre-configured before delivery to a customer, to provide a turn-key solution for a particular application. Unlike general purpose computers, appliances are generally not designed to allow the customers to change the software and the underlying operating system, or to flexibly reconfigure the hardware.

In computing, aspect-oriented software development (AOSD) is a software development technology that seeks new modularizations of software systems in order to isolate secondary or supporting functions from the main program's business logic. AOSD allows multiple concerns to be expressed separately and automatically unified into working systems.

Kernel (operating system) main component of most computer operating systems

The kernel is a computer program at the core of a computer's operating system with complete control over everything in the system. It is the "portion of the operating system code that is always resident in memory". It facilitates interactions between hardware and software components. On most systems, it is one of the first programs loaded on startup. It handles the rest of startup as well as input/output requests from software, translating them into data-processing instructions for the central processing unit. It handles memory and peripherals like keyboards, monitors, printers, and speakers.

An embedded hypervisor is a hypervisor that supports the requirements of embedded systems.

References

  1. Brown, Aaron (June 2001). "Addressing Human Error with Undo" (PDF). ROC Retreat. Retrieved 24 February 2020.