In systems engineering, dependability is a measure of a system's availability, reliability, and its maintainability, and maintenance support performance, and, in some cases, other characteristics such as durability, safety and security.In software engineering, dependability is the ability to provide services that can defensibly be trusted within a time-period. This may also encompass mechanisms designed to increase and maintain the dependability of a system or software.
Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design and manage complex systems over their life cycles. At its core, systems engineering utilizes systems thinking principles to organize this body of knowledge. The individual outcome of such efforts, an engineered system, can be defined as a combination of components that work in synergy to collectively perform a useful function.
Software engineering is the application of engineering to the development of software in a systematic method.
The International Electrotechnical Commission (IEC), via its Technical Committee TC 56 develops and maintains international standards that provide systematic methods and tools for dependability assessment and management of equipment, services, and systems throughout their life cycles.
The International Electrotechnical Commission is an international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known as "electrotechnology". IEC standards cover a vast range of technologies from power generation, transmission and distribution to home appliances and office equipment, semiconductors, fibre optics, batteries, solar energy, nanotechnology and marine energy as well as many others. The IEC also manages three global conformity assessment systems that certify whether equipment, system or components conform to its International Standards.
Dependability can be broken down into three elements:
Some sources hold that word was coined in the nineteen-teens in Dodge Brothers automobile print advertising. But the word predates that period, with the Oxford English Dictionary finding its first use in 1901.
The Oxford English Dictionary (OED) is the principal historical dictionary of the English language, published by Oxford University Press. It traces the historical development of the English language, providing a comprehensive resource to scholars and academic researchers, as well as describing usage in its many variations throughout the world. The second edition, comprising 21,728 pages in 20 volumes, was published in 1989.
As interest in fault tolerance and system reliability increased in the 1960s and 1970s, dependability came to be a measure of [x] as measures of reliability came to encompass additional measures like safety and integrity.In the early 1980s, Jean-Claude Laprie thus chose dependability as the term to encompass studies of fault tolerance and system reliability without the extension of meaning inherent in reliability.
Reliability engineering is a sub-discipline of systems engineering that emphasizes dependability in the lifecycle management of a product. Dependability, or reliability, describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.
The field of dependability has evolved from these beginnings to be an internationally active field of research fostered by a number of prominent international conferences, notably the International Conference on Dependable Systems and Networks, the International Symposium on Reliable Distributed Systems and the International Symposium on Software Reliability Engineering.
The International Conference on Dependable Systems and Networks is an annual conference on topics related to dependable computer systems and reliable networks. It typically features a number of coordinated tracks, including the Dependable Computing and Communications Symposium (DCCS), the Performance and Dependability Symposium (PDS), several workshops, tutorials, a student forum, and fast abstracts. It is sponsored by the IEEE and the IFIP WG 10.4 on Dependable Computing and Fault Tolerance. DSN was formed in 2000 by merging the IEEE International Symposium on Fault-Tolerant Computing (FTCS) and the IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA). The instance number for DSN is taken from FTCS which was up to its 29th instance in its last year of 1999.
The International Symposium on Software Reliability Engineering is an academic conference with strong industry participation running since 1990 and covering reliability engineering for software. The first meeting was organized at Washington DC. IN addition to cities in USA, it has also been held in Paderborn, Germany, Hong Kong, Saint Malo, Bretagne, France, Trollhattan, Sweden, Mysuru, India, Hiroshima, Japan, Naples, Italy and Ottawa, Canada and Toulouse, France. It is interested in properties such as reliability, availability, safety, security and quality of software. It is sponsored by the IEEE Computer Society. The symposium usually last 4 days and has integrated workshops and tutorials in a multi-track program.
Traditionally, dependability for a system incorporates availability, reliability, maintainability but since the 1980s, safety and security have been added to measures of dependability.
In reliability theory and reliability engineering, the term availability has the following meanings:
In engineering, maintainability is the ease with which a product can be maintained in order to:
Safety is the state of being "safe", the condition of being protected from harm or other non-desirable outcomes. Safety can also refer to the control of recognized hazards in order to achieve an acceptable level of risk.
Attributes are qualities of a system. These can be assessed to determine its overall dependability using Qualitative or Quantitative measures. Avizienis et al. define the following Dependability Attributes:
As these definitions suggested, only Availability and Reliability are quantifiable by direct measurements whilst others are more subjective. For instance Safety cannot be measured directly via metrics but is a subjective assessment that requires judgmental information to be applied to give a level of confidence, whilst Reliability can be measured as failures over time.
Confidentiality, i.e. the absence of unauthorized disclosure of information is also used when addressing security. Security is a composite of Confidentiality, Integrity, and Availability. Security is sometimes classed as an attributebut the current view is to aggregate it together with dependability and treat Dependability as a composite term called Dependability and Security.
Practically, applying security measures to the appliances of a system generally improves the dependability by limiting the number of externally originated errors.
Threats are things that can affect a system and cause a drop in Dependability. There are three main terms that must be clearly understood:
It is important to note that Failures are recorded at the system boundary. They are basically Errors that have propagated to the system boundary and have become observable. Faults, Errors and Failures operate according to a mechanism. This mechanism is sometimes known as a Fault-Error-Failure chain.As a general rule a fault, when activated, can lead to an error (which is an invalid state) and the invalid state generated by an error may lead to another error or a failure (which is an observable deviation from the specified behaviour at the system boundary).
Once a fault is activated an error is created. An error may act in the same way as a fault in that it can create further error conditions, therefore an error may propagate multiple times within a system boundary without causing an observable failure. If an error propagates outside the system boundary a failure is said to occur. A failure is basically the point at which it can be said that a service is failing to meet its specification. Since the output data from one service may be fed into another, a failure in one service may propagate into another service as a fault so a chain can be formed of the form: Fault leading to Error leading to Failure leading to Error, etc.
Since the mechanism of a Fault-Error-Chain is understood it is possible to construct means to break these chains and thereby increase the dependability of a system. Four means have been identified so far:
Fault Prevention deals with preventing faults being incorporated into a system. This can be accomplished by use of development methodologies and good implementation techniques.
Fault Removal can be sub-divided into two sub-categories: Removal During Development and Removal During Use.
Removal during development requires verification so that faults can be detected and removed before a system is put into production. Once systems have been put into production a system is needed to record failures and remove them via a maintenance cycle.
Fault Forecasting predicts likely faults so that they can be removed or their effects can be circumvented.
Fault Tolerance deals with putting mechanisms in place that will allow a system to still deliver the required service in the presence of faults, although that service may be at a degraded level.
Dependability means are intended to reduce the number of failures presented to the user of a system. Failures are traditionally recorded over time and it is useful to understand how their frequency is measured so that the effectiveness of means can be assessed.11
Recent works, suchupon dependability take benefit of structured information systems , e.g. with SOA, to introduce a more efficient ability, the survivability , thus taking into account the degraded services that an Information System sustains or resumes after a non-maskable failure.
The flexibility of current frameworks encourage system architects to enable reconfiguration mechanisms that refocus the available, safe resources to support the most critical services rather than over-provisioning to build failure-proof system.
With the generalisation of networked information systems, accessibility was introduced to give greater importance to users' experience.
To take into account the level of performance, the measurement of performability is defined as "quantifying how well the object system performs in the presence of faults over a specified period of time".
Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.
Fault tree analysis (FTA) is a top-down, deductive failure analysis in which an undesired state of a system is analyzed using Boolean logic to combine a series of lower-level events. This analysis method is mainly used in the fields of safety engineering and reliability engineering to understand how systems can fail, to identify the best ways to reduce risk or to determine event rates of a safety accident or a particular system level (functional) failure. FTA is used in the aerospace, nuclear power, chemical and process, pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk factor identification relating to social service system failure. FTA is also used in software engineering for debugging purposes and is closely related to cause-elimination technique used to detect bugs.
Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.
Recovery-oriented computing is a method constructed at Stanford University and the University of California, Berkeley for developing reliable Internet services. Its proponents seek to recognize computer bugs as inevitable, and then reduce their harmful effects. The National Science Foundation funds the project.
A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals' Problem", developed to describe this condition, where actors must agree on a concerted strategy to avoid catastrophic system failure, but some of the actors are unreliable.
In the context of software engineering, software quality refers to two related but distinct notions:
Brian Randell is a British computer scientist, and Emeritus Professor at the School of Computing Science, Newcastle University, UK He specialises in research into software fault tolerance and dependability, and is a noted authority on the early pre-1950 history of computers.
In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.
Reliability, availability and serviceability (RAS) is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.
In systems engineering and requirements engineering, a non-functional requirement (NFR) is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. They are contrasted with functional requirements that define specific behavior or functions. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture, because they are usually architecturally significant requirements.
High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
In software testing, fault injection is a technique for improving the coverage of a test by introducing faults to test code paths, in particular error handling code paths, that might otherwise rarely be followed. It is often used with stress testing and is widely considered to be an important part of developing robust software. Robustness testing is a type of fault injection commonly used to test for vulnerabilities in communication interfaces such as protocols, command line parameters, or APIs.
N-version programming (NVP), also known as multiversion programming or multiple-version dissimilar software, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications. The concept of N-version programming was introduced in 1977 by Liming Chen and Algirdas Avizienis with the central conjecture that the "independence of programming efforts will greatly reduce the probability of identical software faults occurring in two or more versions of the program". The aim of NVP is to improve the reliability of software operation by building in fault tolerance or redundancy.
Fault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they must be able to continue working to a level of satisfaction in the presence of errors or breakdowns.
ISO 26262, titled "Road vehicles – Functional safety", is an international standard for functional safety of electrical and/or electronic systems in production automobiles defined by the International Organization for Standardization (ISO) in 2011.
Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.
From the inspiration of defensive guise behaviors based on mimic phenomenon in biology, Cyber Mimic Defense (CMD) introduces the mechanism of dynamic multi-dimensional reconfiguration into a dissimilar redundancy structure (DRS) which is widely used in the field of reliability. It addresses certain or uncertain threats in cyberspace by the principle of uncertain defense, and provides the strategic varieties and transformations of DRS elements inside objects in quantity or type, time or space dimension under the condition of unchanged visual functions.