Dependability

Last updated November 30, 2024

In systems engineering, dependability is a measure of a system's availability, reliability, maintainability, and in some cases, other characteristics such as durability, safety and security.^[1] In real-time computing, dependability is the ability to provide services that can be trusted within a time-period.^[2] The service guarantees must hold even when the system is subject to attacks or natural failures.

The International Electrotechnical Commission (IEC), via its Technical Committee TC 56 develops and maintains international standards that provide systematic methods and tools for dependability assessment and management of equipment, services, and systems throughout their life cycles. The IFIP Working Group 10.4^[3] on "Dependable Computing and Fault Tolerance" plays a role in synthesizing the technical community's progress in the field and organizes two workshops each year to disseminate the results.

Dependability can be broken down into three elements:

Attributes - a way to assess the dependability of a system
Threats - an understanding of the things that can affect the dependability of a system
Means - ways to increase a system's dependability

History

Some sources hold that word was coined in the nineteen-teens in Dodge Brothers automobile print advertising. But the word predates that period, with the Oxford English Dictionary finding its first use in 1901.

As interest in fault tolerance and system reliability increased in the 1960s and 1970s, dependability came to be a measure of [x] as measures of reliability came to encompass additional measures like safety and integrity.^[4] In the early 1980s, Jean-Claude Laprie thus chose dependability as the term to encompass studies of fault tolerance and system reliability without the extension of meaning inherent in reliability.^[5]

The field of dependability has evolved from these beginnings to be an internationally active field of research fostered by a number of prominent international conferences, notably the International Conference on Dependable Systems and Networks, the International Symposium on Reliable Distributed Systems and the International Symposium on Software Reliability Engineering.

Traditionally, dependability for a system incorporates availability, reliability, maintainability but since the 1980s, safety and security have been added to measures of dependability.^[6]

Elements of dependability

Attributes

Attributes are qualities of a system. These can be assessed to determine its overall dependability using Qualitative or Quantitative measures. Avizienis et al. define the following Dependability Attributes:

Availability - readiness for correct service
Reliability - continuity of correct service
Safety - absence of catastrophic consequences on the user(s) and the environment
Integrity - absence of improper system alteration
Maintainability - ability for easy maintenance (repair)

As these definitions suggested, only Availability and Reliability are quantifiable by direct measurements whilst others are more subjective. For instance Safety cannot be measured directly via metrics but is a subjective assessment that requires judgmental information to be applied to give a level of confidence, whilst Reliability can be measured as failures over time.

Confidentiality, i.e. the absence of unauthorized disclosure of information is also used when addressing security. Security is a composite of Confidentiality, Integrity, and Availability. Security is sometimes classed as an attribute ^[7] but the current view is to aggregate it together with dependability and treat Dependability as a composite term called Dependability and Security.^[2]

Practically, applying security measures to the appliances of a system generally improves the dependability by limiting the number of externally originated errors.

Threats

Threats are things that can affect a system and cause a drop in Dependability. There are three main terms that must be clearly understood:

Fault: A fault (which is usually referred to as a bug for historic reasons) is a defect in a system. The presence of a fault in a system may or may not lead to a failure. For instance, although a system may contain a fault, its input and state conditions may never cause this fault to be executed so that an error occurs; and thus that particular fault never exhibits as a failure.
Error: An error is a discrepancy between the intended behavior of a system and its actual behavior inside the system boundary. Errors occur at runtime when some part of the system enters an unexpected state due to the activation of a fault. Since errors are generated from invalid states they are hard to observe without special mechanisms, such as debuggers or debug output to logs.
Failure: A failure is an instance in time when a system displays behavior that is contrary to its specification. An error may not necessarily cause a failure, for instance an exception may be thrown by a system but this may be caught and handled using fault tolerance techniques so the overall operation of the system will conform to the specification.

It is important to note that Failures are recorded at the system boundary. They are basically Errors that have propagated to the system boundary and have become observable. Faults, Errors and Failures operate according to a mechanism. This mechanism is sometimes known as a Fault-Error-Failure chain.^[8] As a general rule a fault, when activated, can lead to an error (which is an invalid state) and the invalid state generated by an error may lead to another error or a failure (which is an observable deviation from the specified behavior at the system boundary).^[9]

Once a fault is activated an error is created. An error may act in the same way as a fault in that it can create further error conditions, therefore an error may propagate multiple times within a system boundary without causing an observable failure. If an error propagates outside the system boundary a failure is said to occur. A failure is basically the point at which it can be said that a service is failing to meet its specification. Since the output data from one service may be fed into another, a failure in one service may propagate into another service as a fault so a chain can be formed of the form: Fault leading to Error leading to Failure leading to Error, etc.

Means

Since the mechanism of a Fault-Error-Chain is understood it is possible to construct means to break these chains and thereby increase the dependability of a system. Four means have been identified so far:

Prevention
Removal
Forecasting
Tolerance

Fault Prevention deals with preventing faults being introduced into a system. This can be accomplished by use of development methodologies and good implementation techniques.

Fault Removal can be sub-divided into two sub-categories: Removal During Development and Removal During Use.
Removal during development requires verification so that faults can be detected and removed before a system is put into production. Once systems have been put into production a system is needed to record failures and remove them via a maintenance cycle.

Fault Forecasting predicts likely faults so that they can be removed or their effects can be circumvented.^[10]^[11]

Fault Tolerance deals with putting mechanisms in place that will allow a system to still deliver the required service in the presence of faults, although that service may be at a degraded level.

Dependability means are intended to reduce the number of failures made visible to the end users of a system.

Persistence

Based on how faults appear or persist, they are classified as:

Transient: They appear without apparent cause and disappear again without apparent cause
Intermittent: They appear multiple times, possibly without a discernible pattern, and disappear on their own
Permanent: Once they appear, they do not get resolved on their own

Dependability of information systems and survivability

Some works on dependability ^[12] use structured information systems , e.g. with SOA, to introduce the attribute survivability , thus taking into account the degraded services that an Information System sustains or resumes after a non-maskable failure.

The flexibility of current frameworks encourage system architects to enable reconfiguration mechanisms that refocus the available, safe resources to support the most critical services rather than over-provisioning to build failure-proof system.

With the generalisation of networked information systems, accessibility was introduced to give greater importance to users' experience.

To take into account the level of performance, the measurement of performability is defined as "quantifying how well the object system performs in the presence of faults over a specified period of time".^[13]

Related Research Articles

Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.

A safety-critical system or life-critical system is a system whose failure or malfunction may result in one of the following outcomes:

Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.

A Byzantine fault is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of a system, the system's actors must agree on a strategy, but some of these actors are unreliable in such a way as to cause other (good) actors to disagree on the strategy and they may be unaware of the disagreement.

Brian Randell DSc FBCS FLSW is a British computer scientist, and emeritus professor at the School of Computing, Newcastle University, United Kingdom. He specialises in research into software fault tolerance and dependability, and is a noted authority on the early pre-1950 history of computing hardware.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, OR will operate in a defined environment without failure. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.

Intrusion tolerance is a fault-tolerant design approach to defending information systems against malicious attacks. In that sense, it is also a computer security approach. Abandoning the conventional aim of preventing all intrusions, intrusion tolerance instead calls for triggering mechanisms that prevent intrusions from leading to a system security failure.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

Zhiming Liu is a computer scientist. He studied mathematics in Luoyang, Henan in China and obtained his first degree in 1982. He holds a master's degree in Computer Science from the Institute of Software of the Chinese Academy of Sciences (1988), and a PhD degree from the University of Warwick (1991). His PhD thesis was on Fault-Tolerant Programming by Transformations.

N-version programming (NVP), also known as multiversion programming or multiple-version dissimilar software, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications. The concept of N-version programming was introduced in 1977 by Liming Chen and Algirdas Avizienis with the central conjecture that the "independence of programming efforts will greatly reduce the probability of identical software faults occurring in two or more versions of the program". The aim of NVP is to improve the reliability of software operation by building in fault tolerance or redundancy.

Keith Marzullo is the inventor of Marzullo's algorithm, which is part of the basis of the Network Time Protocol and the Windows Time Service. On August 1, 2016 he became the Dean of the University of Maryland College of Information Studies after serving as the Director of the NITRD National Coordination Office. Prior to this he was a Professor in the Department of Computer Science and Engineering at University of California, San Diego. In 2011 he was inducted as a Fellow of the Association for Computing Machinery.

Fred Barry Schneider is an American computer scientist, based at Cornell University, where he is the Samuel B. Eckert Professor of Computer Science. He has published in numerous areas including science policy, cybersecurity, and distributed systems. His research is in the area of concurrent and distributed systems for high-integrity and mission-critical applications.

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system. If there is an SPOF present in a system, it produces a potential interruption to the system that is substantially more disruptive than an error would elsewhere in the system.

A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

Neeraj Suri is an American-Austrian computer scientist. He is a Distinguished University Professor at Lancaster University (UK) and an adjunct professor of Computer Science at the University of Massachusetts at Amherst.

References

↑ IEC, Electropedia del 192 Dependability, http://www.electropedia.org, select 192 Dependability, see 192-01-22 Dependability.
1 2 A. Avizienis, J.-C. Laprie, Brian Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, pp. 11-33, 2004.
↑ "Dependable Systems and Networks". www.dependability.org. Retrieved 2021-06-08.
↑ Brian Randell, "Software Dependability: A Personal View", in the Proc of the 25th International Symposium on Fault-Tolerant Computing (FTCS-25), California, USA, pp 35-41, June 1995.
↑ J.C. Laprie. "Dependable Computing and Fault Tolerance: Concepts and terminology," in Proc. 15th IEEE Int. Symp. on Fault-Tolerant Computing, 1985
↑ A. Avizienis, J.-C. Laprie and Brian Randell: Fundamental Concepts of Dependability . Research Report No 1145, Lydford g DrAAS-CNRS, April 2001
↑ I. Sommerville, Software Engineering: Addison-Wesley, 2004.
↑ A. Avizienis, V. Magnus U, J. C. Laprie, and Brian Randell, "Fundamental Concepts of Dependability," presented at ISW-2000, Cambridge, MA, 2000.
↑ Moradi, Mehrdad; Van Acker, Bert; Vanherpen, Ken; Denil, Joachim (2019). Chamberlain, Roger; Taha, Walid; Törngren, Martin (eds.). "Model-Implemented Hybrid Fault Injection for Simulink (Tool Demonstrations)". Cyber Physical Systems. Model-Based Design. Lecture Notes in Computer Science. 11615. Cham: Springer International Publishing: 71–90. doi:10.1007/978-3-030-23703-5_4. ISBN 978-3-030-23703-5. S2CID 195769468.
↑ "Optimizing fault injection in FMI co-simulation through sensitivity partitioning | Proceedings of the 2019 Summer Simulation Conference". dl.acm.org. Retrieved 2020-06-15.
↑ Moradi, Mehrdad, Bentley James Oakes, Mustafa Saraoglu, Andrey Morozov, Klaus Janschek, and Joachim Denil. "Exploring Fault Parameter Space Using Reinforcement Learning-based Fault Injection." (2020).
↑ John C. Knight, Elisabeth A. Strunk, Kevin J. Sullivan: Towards a Rigorous Definition of Information System Survivability Archived 2006-10-29 at the Wayback Machine
↑ John F. Meyer, William H. Sanders Specification and construction of performability models
↑ "DSN 2022". dsn2022.github.io. Retrieved 2021-08-01.
↑ "SRDS-2021". srds-conference.org. Retrieved 2021-08-01.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] IEC, Electropedia del 192 Dependability, http://www.electropedia.org, select 192 Dependability, see 192-01-22 Dependability.

[A._Avizienis,_J_pp._11-33-2] 1 2 A. Avizienis, J.-C. Laprie, Brian Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, pp. 11-33, 2004.

[3] "Dependable Systems and Networks". www.dependability.org. Retrieved 2021-06-08.

[4] Brian Randell, "Software Dependability: A Personal View", in the Proc of the 25th International Symposium on Fault-Tolerant Computing (FTCS-25), California, USA, pp 35-41, June 1995.

[ReferenceA-5] J.C. Laprie. "Dependable Computing and Fault Tolerance: Concepts and terminology," in Proc. 15th IEEE Int. Symp. on Fault-Tolerant Computing, 1985

[6] A. Avizienis, J.-C. Laprie and Brian Randell: Fundamental Concepts of Dependability . Research Report No 1145, Lydford g DrAAS-CNRS, April 2001

[7] I. Sommerville, Software Engineering: Addison-Wesley, 2004.

[8] A. Avizienis, V. Magnus U, J. C. Laprie, and Brian Randell, "Fundamental Concepts of Dependability," presented at ISW-2000, Cambridge, MA, 2000.

[9] Moradi, Mehrdad; Van Acker, Bert; Vanherpen, Ken; Denil, Joachim (2019). Chamberlain, Roger; Taha, Walid; Törngren, Martin (eds.). "Model-Implemented Hybrid Fault Injection for Simulink (Tool Demonstrations)". Cyber Physical Systems. Model-Based Design. Lecture Notes in Computer Science. 11615. Cham: Springer International Publishing: 71–90. doi:10.1007/978-3-030-23703-5_4. ISBN 978-3-030-23703-5. S2CID 195769468.

[10] "Optimizing fault injection in FMI co-simulation through sensitivity partitioning | Proceedings of the 2019 Summer Simulation Conference". dl.acm.org. Retrieved 2020-06-15.

[11] Moradi, Mehrdad, Bentley James Oakes, Mustafa Saraoglu, Andrey Morozov, Klaus Janschek, and Joachim Denil. "Exploring Fault Parameter Space Using Reinforcement Learning-based Fault Injection." (2020).

[12] John C. Knight, Elisabeth A. Strunk, Kevin J. Sullivan: Towards a Rigorous Definition of Information System Survivability Archived 2006-10-29 at the Wayback Machine

[13] John F. Meyer, William H. Sanders Specification and construction of performability models

[14] "DSN 2022". dsn2022.github.io. Retrieved 2021-08-01.

[15] "SRDS-2021". srds-conference.org. Retrieved 2021-08-01.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

v t e Computer science
Note: This template roughly follows the 2012 ACM Computing Classification System.
Hardware	Printed circuit board Peripheral Integrated circuit Very Large Scale Integration Systems on Chip (SoCs) Energy consumption (Green computing) Electronic design automation Hardware acceleration Processor Size / Form
Computer systems organization	Computer architecture Computational complexity Dependability Embedded system Real-time computing
Networks	Network architecture Network protocol Network components Network scheduler Network performance evaluation Network service
Software organization	Interpreter Middleware Virtual machine Operating system Software quality
Software notations and tools	Programming paradigm Programming language Compiler Domain-specific language Modeling language Software framework Integrated development environment Software configuration management Software library Software repository
Software development	Control variable Software development process Requirements analysis Software design Software construction Software deployment Software engineering Software maintenance Programming team Open-source model
Theory of computation	Model of computation Stochastic Formal language Automata theory Computability theory Computational complexity theory Logic Semantics
Algorithms	Algorithm design Analysis of algorithms Algorithmic efficiency Randomized algorithm Computational geometry
Mathematics of computing	Discrete mathematics Probability Statistics Mathematical software Information theory Mathematical analysis Numerical analysis Theoretical computer science
Information systems	Database management system Information storage systems Enterprise information system Social information systems Geographic information system Decision support system Process control system Multimedia information system Data mining Digital library Computing platform Digital marketing World Wide Web Information retrieval
Security	Cryptography Formal methods Security hacker Security services Intrusion detection system Hardware security Network security Information security Application security
Human–computer interaction	Interaction design Social computing Ubiquitous computing Visualization Accessibility
Concurrency	Concurrent computing Parallel computing Distributed computing Multithreading Multiprocessing
Artificial intelligence	Natural language processing Knowledge representation and reasoning Computer vision Automated planning and scheduling Search methodology Control method Philosophy of artificial intelligence Distributed artificial intelligence
Machine learning	Supervised learning Unsupervised learning Reinforcement learning Multi-task learning Cross-validation
Graphics	Animation Rendering Photograph manipulation Graphics processing unit Mixed reality Virtual reality Image compression Solid modeling
Applied computing	Quantum Computing E-commerce Enterprise software Computational mathematics Computational physics Computational chemistry Computational biology Computational social science Computational engineering Differentiable computing Computational healthcare Digital art Electronic publishing Cyberwarfare Electronic voting Video games Word processing Operations research Educational technology Document management
Category Outline Glossaries

Dependability

Contents

History

Elements of dependability

Attributes

Threats

Means

Persistence

Dependability of information systems and survivability

See also

Further reading

Papers

Conferences

Journals

Books

Research projects

Related Research Articles

References