Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. [1]
In software development, the ability of a given software to tolerate failures while still ensuring adequate quality of service—often termed resilience—is typically specified as a requirement. However, development teams may fail to meet this requirement due to factors such as short deadlines or lack of domain knowledge. Chaos engineering encompasses techniques aimed at meeting resilience requirements.
Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.
Calculating how much confidence we have in the interconnected complex systems that are put into production environments requires operational readiness metrics. Operational readiness can be evaluated using chaos engineering simulations. Solutions for increasing the resilience and operational readiness of a platform include strengthening the backup, restore, network file transfer, failover capabilities and overall security of the environment.
An evaluation to induce chaos in a Kubernetes environment terminated random pods receiving data from edge devices in data centers while processing analytics on a big data network. The pods' recovery time was a resiliency metric that estimated the response time. [2] [3]
1983 – Apple
While MacWrite and MacPaint were being developed for the first Apple Macintosh computer, Steve Capps created "Monkey", a desk accessory which randomly generated user interface events at high speed, simulating a monkey frantically banging the keyboard and moving and clicking the mouse. It was promptly put to use for debugging by generating errors for programmers to fix, because automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated. [4]
1992 – Prologue While ABAL2 and SING were being developed for the first graphical versions of the PROLOGUE operating system, Iain James Marshall created "La Matraque", a desk accessory which randomly generated random sequences of both legal and invalid graphical interface events, at high speed, thus testing the critical edge behaviour of the underlying graphics libraries. This program would be launched prior to production delivery, for days on end, thus ensuring the required degree of total resilience. This tool was subsequently extended to include the Database and other File Access instructions of the ABAL language to check and ensure their subsequent resiliance. A variation, of this tool, is currently employed for the qualification of the modern day version known as OPENABAL.
2003 – Amazon
While working to improve website reliability at Amazon, Jesse Robbins created "Game day", [5] an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said it was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering. [6]
2006 – Google
While at Google, Kripa Krishnan created a similar program to Amazon's Game day (see above) called "DiRT". [6] [7] [8] Jason Cahoon, a Site Reliability Engineer [9] at Google, contributed a chapter on Google DiRT [10] in the "Chaos Engineering" book [11] and described the system at the GOTOpia 2021 conference. [12]
2011 – Netflix
While overseeing Netflix's migration to the cloud in 2011 Nora Jones, Casey Rosenthal, and Greg Orzell [11] [13] [14] expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services." [15]
By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.
The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by Martin Fowler in 2012. [16]
Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. [13] It works by intentionally disabling computers in Netflix's production network to test how the remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license. [17] [18]
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez: [19]
Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.
The Simian Army [18] is a suite of tools developed by Netflix to test the reliability, security, or resilience of its Amazon Web Services infrastructure and includes the following tools: [20]
Voyages-sncf.com's 2017 "Day of Chaos" [23] gamified simulating pre-production failures [24] to present at the 2017 DevOps REX conference. [25] Founded in 2019, Steadybit popularized pre-production chaos and reliability engineering. [26] Its open-source Reliability Hub extends Steadybit. [27] [28]
Proofdock can inject infrastructure, platform, and application failures on Microsoft Azure DevOps. [26] Gremlin is a "failure-as-a-service" platform. [29] Facebook's Project Storm simulates datacenter failures for natural disaster resistance. [30]
In software quality assurance, performance testing is in general a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.
In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in cases of failures, including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.
The following outline is provided as an overview of and topical guide to software engineering:
Load testing is the process of putting demand on a structure or system and measuring its response.
IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.
In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exceptions such as crashes, failing built-in code assertions, or potential memory leaks. Typically, fuzzers are used to test programs that take structured inputs. This structure is specified, e.g., in a file format or protocol and distinguishes valid from invalid input. An effective fuzzer generates semi-valid inputs that are "valid enough" in that they are not directly rejected by the parser, but do create unexpected behaviors deeper in the program and are "invalid enough" to expose corner cases that have not been properly dealt with.
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, OR will operate in a defined environment without failure. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.
Software assurance (SwA) is a critical process in software development that ensures the reliability, safety, and security of software products. It involves a variety of activities, including requirements analysis, design reviews, code inspections, testing, and formal verification. One crucial component of software assurance is secure coding practices, which follow industry-accepted standards and best practices, such as those outlined by the Software Engineering Institute (SEI) in their CERT Secure Coding Standards (SCS).
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Urs Hölzle is a Swiss-American software engineer and technology executive. As Google's eighth employee and its first VP of Engineering, he has shaped much of Google's development processes and infrastructure, as well as its engineering culture. His most notable contributions include leading the development of fundamental cloud infrastructure such as energy-efficient data centers, distributed compute and storage systems, and software-defined networking. Until July 2023, he was the Senior Vice President of Technical Infrastructure and Google Fellow at Google. In July 2023, he transitioned to being a Google Fellow only.
In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.
Stress testing is a software testing activity that determines the robustness of software by testing beyond the limits of normal operation. Stress testing is particularly important for "mission critical" software, but is used for all types of software. Stress tests commonly put a greater emphasis on robustness, availability, and error handling under a heavy load, than on what would be considered correct behavior under normal circumstances.
Amazon Elastic Compute Cloud (EC2) is a part of Amazon's cloud-computing platform, Amazon Web Services (AWS), that allows users to rent virtual computers on which to run their own computer applications. EC2 encourages scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image (AMI) to configure a virtual machine, which Amazon calls an "instance", containing any software desired. A user can create, launch, and terminate server-instances as needed, paying by the second for active servers – hence the term "elastic". EC2 provides users with control over the geographical location of instances that allows for latency optimization and high levels of redundancy. In November 2010, Amazon switched its own retail website platform to EC2 and AWS.
Christopher Arthur Lattner is an American software engineer and creator of LLVM, the Clang compiler, the Swift programming language and the MLIR compiler infrastructure.
Eucalyptus is a paid and open-source computer software for building Amazon Web Services (AWS)-compatible private and hybrid cloud computing environments, originally developed by the company Eucalyptus Systems. Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems. Eucalyptus enables pooling compute, storage, and network resources that can be dynamically scaled up or down as application workloads change. Mårten Mickos was the CEO of Eucalyptus. In September 2014, Eucalyptus was acquired by Hewlett-Packard and then maintained by DXC Technology. After DXC stopped developing the product in late 2017, AppScale Systems forked the code and started supporting Eucalyptus customers.
DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle. DevOps is complementary to agile software development; several DevOps aspects came from the agile way of working.
Jesse Robbins is an American technology entrepreneur, investor, and firefighter notable for his pioneering work in Cloud computing, role in creating DevOps/Chaos Engineering, and efforts to improve emergency management.
Autoscaling, also spelled auto scaling or auto-scaling, and sometimes also called automatic scaling, is a method used in cloud computing that dynamically adjusts the amount of computational resources in a server farm - typically measured by the number of active servers - automatically based on the load on the farm. For example, the number of servers running behind a web application may be increased or decreased automatically based on the number of active users on the site. Since such metrics may change dramatically throughout the course of the day, and servers are a limited resource that cost money to run even while idle, there is often an incentive to run "just enough" servers to support the current load while still being able to support sudden and large spikes in activity. Autoscaling is helpful for such needs, as it can reduce the number of active servers when activity is low, and launch new servers when activity is high. Autoscaling is closely related to, and builds upon, the idea of load balancing.
Site Reliability Engineering (SRE) is a subset of web development that encompasses principles and practices that integrate software engineering with IT infrastructure and operations to enhance system reliability. SRE shares some similarities with DevOps, which focuses on software development and operational practices.
This is a timeline of Amazon Web Services, which offers a suite of cloud computing services that make up an on-demand computing platform.