Chaos engineering

Last updated

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. [1]

Contents

Concept

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service—often generalized as resilience—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.

Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.

Operational readiness using chaos engineering

Calculating how much confidence we have in the interconnected complex systems that are put into production environments requires operational readiness metrics. Operational readiness can be evaluated using chaos engineering simulations supported by Kubernetes infrastructure. Solutions for increasing the resilience and operational readiness of a platform include strengthening the backup, restore, network file transfer, failover capabilities and overall security of the environment. Gautam Siwach et al, performed an evaluation of inducing chaos in a Kubernetes environment which terminates random pods with data from edge devices in data centers while processing analytics on a big data network, and inferring the recovery time of pods to calculate an estimated response time, as a resilience metric. [2] [3]

History

1983 – Apple

While MacWrite and MacPaint were being developed for the first Apple Macintosh computer, Steve Capps created "Monkey", a desk accessory which randomly generated user interface events at high speed, simulating a monkey frantically banging the keyboard and moving and clicking the mouse. It was promptly put to use for debugging by generating errors for programmers to fix, because automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated. [4]

1992 – Prologue While ABAL2 and SING were being developed for the first graphical versions of the PROLOGUE operating system, Iain James Marshall created "La Matraque", a desk accessory which randomly generated random sequences of both legal and invalid graphical interface events, at high speed, thus testing the critical edge behaviour of the underlying graphics libraries. This program would be launched prior to production delivery, for days on end, thus ensuring the required degree of total resilience. This tool was subsequently extended to include the Database and other File Access instructions of the ABAL language to check and ensure their subsequent resiliance. A variation, of this tool, is currently employed for the qualification of the modern day version known as OPENABAL.

2003 – Amazon

While working to improve website reliability at Amazon, Jesse Robbins created "Game day", [5] an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said it was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering. [6]

2006 – Google

While at Google, Kripa Krishnan created a similar program to Amazon's Game day (see above) called "DiRT". [6] [7] [8] Jason Cahoon, a Site Reliability Engineer [9] at Google, contributed a chapter on Google DiRT [10] in the "Chaos Engineering" book [11] and described the system at the GOTOpia 2021 conference. [12]

2011 – Netflix

While overseeing Netflix's migration to the cloud in 2011 Nora Jones, Casey Rosenthal, and Greg Orzell [11] [13] [14] expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:

"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services." [15]

By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by Martin Fowler in 2012. [16]

Chaos engineering tools

Chaos Monkey

The logo for Chaos Monkey used by Netflix LogoChaosMonkeysNetflix.png
The logo for Chaos Monkey used by Netflix

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. [13] It works by intentionally disabling computers in Netflix's production network to test how the remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license. [17] [18]

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez: [19]

Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.

Simian Army

The Simian Army [18] is a suite of tools developed by Netflix to test the reliability, security, or resilience of its Amazon Web Services infrastructure and includes the following tools: [20]

At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "Region". [21] Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.

Chaos Gorilla drops a full Amazon "Availability Zone" (one or more entire data centers serving a geographical region). [22]

Proofdock chaos engineering platform

Proofdock is a chaos engineering platform that focuses on and leverages the Microsoft Azure platform and the Azure DevOps services. Users can inject failures on the infrastructure, platform and application level. [23]

Gremlin

Gremlin is a "failure-as-a-service" platform. [24]

Facebook Storm

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures. [25]

Days of Chaos

Voyages-sncf.com created a "Day of Chaos" [26] in 2017, gamifying the simulation of pre-production failures. [27] They presented their results at the 2017 DevOps REX conference. [28]

See also

Notes and references

  1. "Principles of Chaos Engineering". principlesofchaos.org. Retrieved 21 October 2017.
  2. Siwach, Gautam (29 November 2022). Evaluating operational readiness using chaos engineering simulations on Kubernetes architecture in Big Data (pdf). 2022 International Conference on Smart Applications, Communications and Networking (SmartNets). Botswana. pp. 1–7. Retrieved 3 January 2023.
  3. "Machine Learning Podcast Host and Technology Influencer: Gautam Siwach". LA Weekly. 7 October 2022.
  4. Hertzfeld, Andy. "Monkey Lives". Folklore. Retrieved 11 September 2023.
  5. "Game day". AWS Well-Architected Framework Glossary. Amazon. 31 December 2020. Retrieved 25 February 2024.
  6. 1 2 Limoncelli, Tom (13 September 2012). "Resilience Engineering: Learning to Embrace Failure". ACM Queue . 10 (9) via ACM.
  7. Krishnan, Kripa (16 September 2012). "Weathering the Unexpected". ACM Queue . 10 (9) via ACM.
  8. Krishnan, Kripa (8–13 November 2015). 10 Years of Crashing Google (html). 2015 Usenix LISA. Washington DC. Retrieved 25 February 2024.
  9. Beyer, Betsy; Jones, Chris (2016). Site Reliability Engineering (1st ed.). O'Reilly Media. ISBN   9781491929124. OCLC   1291707340.
  10. "Chapter 5. Google DiRT: Disaster Recovery Testing". "Chaos Engineering" book website. O'Reilly Media. 30 April 2020. Retrieved 25 February 2024.
  11. 1 2 Jones, Nora; Rosenthal, Casey (2020). Chaos Engineering (1st ed.). O'Reilly Media. ISBN   9781492043867. OCLC   1143015464.
  12. Cahoon, Jason (2 June 2021). "WATCH: The DiRT on Chaos Engineering at Google" (video). youtube.com. GOTO Conferences.
  13. 1 2 "The Netflix Simian Army". Netflix Tech Blog. Medium. 19 July 2011. Retrieved 21 October 2017.
  14. US 20120072571,Orzell, Gregory S.&Izrailevsky, Yury,"Validating the resiliency of networked applications",published 2012-03-22
  15. "Netflix Chaos Monkey Upgraded". Netflix Tech Blog. Medium. 19 October 2016. Retrieved 21 October 2017.
  16. "PhoenixServer". martinFowler.com. Martin Fowler (software engineer). 10 July 2012. Retrieved 14 January 2021.
  17. "Netflix libère Chaos Monkey dans la jungle Open Source" [Netflix releases Chaos Monkey into the open source jungle]. Le Monde Informatique (in French). Retrieved 7 November 2017.
  18. 1 2 "SimianArmy: Tools for your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures". Netflix, Inc. 20 October 2017. Retrieved 21 October 2017.
  19. "Mais qui sont ces singes du chaos ?" [But who are these monkeys of chaos?]. 15marches (in French). 25 July 2017. Retrieved 21 October 2017.
  20. SemiColonWeb (8 December 2015). "Infrastructure : quelles méthodes pour s'adapter aux nouvelles architectures Cloud ? - D2SI Blog". D2SI Blog (in French). Archived from the original on 21 October 2017. Retrieved 7 November 2017.
  21. "Chaos Engineering Upgraded", medium.com, 19 April 2017, retrieved 10 April 2020
  22. "The Netflix Simian Army", medium.com, retrieved 12 December 2017
  23. "A chaos engineering platform for Microsoft Azure". medium.com. 25 June 2020. Retrieved 28 June 2020.
  24. "Gremlin raises $18 million to expand 'failure-as-a-service' testing platform". VentureBeat. 28 September 2018. Retrieved 24 October 2018.
  25. Hof, Robert (11 September 2016), "Interview: How Facebook's Storm Heads Off Project Data Center Disasters", Forbes, retrieved 21 October 2017
  26. "Days of Chaos". Days of Chaos (in French). Retrieved 18 February 2022.
  27. "DevOps: feedback from Voyages-sncf.com". Moderator's Blog (in French). 17 March 2017. Retrieved 21 October 2017.
  28. devops REX (3 October 2017). "[devops REX 2017] Days of Chaos : le développement de la culture devops chez Voyages-Sncf.com à l'aide de la gamification" . Retrieved 18 February 2022.

Related Research Articles

NetApp, Inc. is an intelligent data infrastructure company that provides unified data storage, integrated data services, and cloud operations (CloudOps) solutions to enterprise customers. The company is based in San Jose, California. It has ranked in the Fortune 500 from 2012 to 2021. Founded in 1992 with an initial public offering in 1995, NetApp offers cloud data services for management of applications and data both online and physically.

In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exceptions such as crashes, failing built-in code assertions, or potential memory leaks. Typically, fuzzers are used to test programs that take structured inputs. This structure is specified, e.g., in a file format or protocol and distinguishes valid from invalid input. An effective fuzzer generates semi-valid inputs that are "valid enough" in that they are not directly rejected by the parser, but do create unexpected behaviors deeper in the program and are "invalid enough" to expose corner cases that have not been properly dealt with.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

<span class="mw-page-title-main">Amazon Elastic Compute Cloud</span> Cloud computing platform

Amazon Elastic Compute Cloud (EC2) is a part of Amazon.com's cloud-computing platform, Amazon Web Services (AWS), that allows users to rent virtual computers on which to run their own computer applications. EC2 encourages scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image (AMI) to configure a virtual machine, which Amazon calls an "instance", containing any software desired. A user can create, launch, and terminate server-instances as needed, paying by the second for active servers – hence the term "elastic". EC2 provides users with control over the geographical location of instances that allows for latency optimization and high levels of redundancy. In November 2010, Amazon switched its own retail website platform to EC2 and AWS.

Eucalyptus is a paid and open-source computer software for building Amazon Web Services (AWS)-compatible private and hybrid cloud computing environments, originally developed by the company Eucalyptus Systems. Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems. Eucalyptus enables pooling compute, storage, and network resources that can be dynamically scaled up or down as application workloads change. Mårten Mickos was the CEO of Eucalyptus. In September 2014, Eucalyptus was acquired by Hewlett-Packard and then maintained by DXC Technology. After DXC stopped developing the product in late 2017, AppScale Systems forked the code and started supporting Eucalyptus customers.

<span class="mw-page-title-main">Jesse Robbins</span> American entrepreneur

Jesse Robbins is an American technology entrepreneur, investor, and firefighter notable for his pioneering work in Cloud computing, role in creating DevOps/Chaos Engineering, and efforts to improve emergency management.

<span class="mw-page-title-main">Zerto</span>

Zerto provides disaster recovery, ransomware resilience and workload mobility software for virtualized infrastructures and cloud environments. Zerto is a subsidiary of Hewlett Packard Enterprise company which is headquartered in Spring, Texas, USA.

<span class="mw-page-title-main">Datadog</span> An observability and security platform for cloud applications.

Datadog, Inc. provides an observability and security SaaS platform for cloud applications. The platform helps corporations monitor servers, databases, software tools, and infrastructure services.

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma, et.al. Registration requires a credit card or bank account details.

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by a worldwide community of contributors, and the trademark is held by the Cloud Native Computing Foundation.

<span class="mw-page-title-main">Mirantis</span> Cloud computing software and services company

Mirantis Inc. is a Campbell, California, based B2B open source cloud computing software and services company. Its primary container and cloud management products, part of the Mirantis Cloud Native Platform suite of products, are Mirantis Container Cloud and Mirantis Kubernetes Engine. The company focuses on the development and support of container and cloud infrastructure management platforms based on Kubernetes and OpenStack. The company was founded in 1999 by Alex Freedland and Boris Renski. It was one of the founding members of the OpenStack Foundation, a non-profit corporate entity established in September, 2012 to promote OpenStack software and its community. Mirantis has been an active member of the Cloud Native Computing Foundation since 2016.

Autoscaling, also spelled auto scaling or auto-scaling, and sometimes also called automatic scaling, is a method used in cloud computing that dynamically adjusts the amount of computational resources in a server farm - typically measured by the number of active servers - automatically based on the load on the farm. For example, the number of servers running behind a web application may be increased or decreased automatically based on the number of active users on the site. Since such metrics may change dramatically throughout the course of the day, and servers are a limited resource that cost money to run even while idle, there is often an incentive to run "just enough" servers to support the current load while still being able to support sudden and large spikes in activity. Autoscaling is helpful for such needs, as it can reduce the number of active servers when activity is low, and launch new servers when activity is high. Autoscaling is closely related to, and builds upon, the idea of load balancing.

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations. SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, VMs, or physical servers. Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

<span class="mw-page-title-main">Prometheus (software)</span> Application used for event monitoring and alerting

Prometheus is a free software application used for event monitoring and alerting. It records metrics in a time series database built using an HTTP pull model, with flexible queries and real-time alerting. The project is written in Go and licensed under the Apache 2 License, with source code available on GitHub, and is a graduated project of the Cloud Native Computing Foundation, along with Kubernetes and Envoy.

<span class="mw-page-title-main">IBM Cloud</span> Cloud computing services provided by IBM

IBM Cloud is a set of cloud computing services for business offered by the information technology company IBM.

<span class="mw-page-title-main">Apache Airflow</span> Open-source workflow management platform

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was started in 2015 to help advance container technology and align the tech industry around its evolution.

Teleport is an open-source tool for providing zero trust access to servers and cloud applications using SSH, Kubernetes and HTTPS. It can eliminate the need for VPNs by providing a single gateway to access computing infrastructure via SSH, Kubernetes clusters, and cloud applications via a built-in proxy.