Failure transparency

Last updated September 07, 2021

In a distributed system, failure transparency refers to the extent to which errors and subsequent recoveries of hosts and services within the system are invisible to users and applications. For example, if a server fails, but users are automatically redirected to another server and never notice the failure, the system is said to exhibit high failure transparency.

Failure transparency is one of the most difficult types of transparency to achieve since it is often difficult to determine whether a server has actually failed, or whether it is simply responding very slowly.^[1] Additionally, it is generally impossible to achieve full failure transparency in a distributed system since networks are unreliable.

There is also usually a trade-off between achieving a high level of failure transparency and maintaining an adequate level of system performance. For example, if a distributed system attempts to mask a transient server failure by having the client try to contact the failed server multiple times, performance of the system may be negatively affected. In this case, it would have been preferable to have given up earlier and tried another server.^[1]

Related Research Articles

In computing, load balancing refers to the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

Any change in a computing system, such as a new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behaviour. The purpose is to shield from change all systems on the other end of the interface. Confusingly, the term refers to overall invisibility of the component, it does not refer to visibility of component's internals. The term transparent is widely used in computing marketing in substitution of the term invisible, since the term invisible has a bad connotation while the term transparent has a good connotation. The vast majority of the times, the term transparent is used in a misleading way to refer to the actual invisibility of a computing process, which is also described by the term opaque, especially with regards to data structures. Because of this misleading and counter-intuitive definition, modern computer literature tends to prefer use of "agnostic" over "transparent".

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for the long running applications that are executed in the failure-prone computing systems.

Self-stabilization is a concept of fault-tolerance in distributed systems. Given any initial state, a self-stabilizing distributed system will end up in a correct state in a finite number of execution steps.

In systems engineering, dependability is a measure of a system's availability, reliability, maintainability, and in some cases, other characteristics such as durability, safety and security. In real-time computing, dependability is the ability to provide services that can be trusted within a time-period. The service guarantees must hold even when the system is subject to attacks or natural failures.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In computer science, state machine replication or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

Service-oriented architectures (SOA) are based on the notion of software services, which are high-level software components that include web services. Implementation of an SOA requires tools as well as run-time infrastructure software. This is collectively referred to as a service-oriented architecture implementation framework or (SOAIF). The SOAIF envisions a comprehensive framework that provides all the technology that an enterprise might need to build and run an SOA. An SOAIF includes both design-time and run-time capabilities as well as all the software functionality an enterprise needs to build and operate an SOA, including service-oriented:

A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

In quantum computing, the quantumthreshold theorem states that a quantum computer with a physical error rate below a certain threshold can, through application of quantum error correction schemes, suppress the logical error rate to arbitrarily low levels. This shows that quantum computers can be made fault-tolerant, as an analogue to von Neumann's threshold theorem for classical computation. This result was proven by the groups of Dorit Aharanov and Michael Ben-Or; Emanuel Knill, Raymond Laflamme, and Wojciech Zurek; and Alexei Kitaev independently. These results built off a paper of Peter Shor, which proved a weaker version of the threshold theorem.

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

A distributed operating system is system software over a collection of independent, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

Quasi-opportunistic supercomputing is a computational paradigm for supercomputing on a large number of geographically disperse computers. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing.

References

1 2 Tanenbaum, Andrew S. and Maarten van Steen, Distributed Systems: Principles and Paradigms, Prentice Hall, Second Edition, 2007. ISBN 0-13-239227-5

Failure transparency

Related Research Articles

References

See also