Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Concept
In software development, the ability of a given software to
tolerate failures while still ensuring adequate
quality of service
Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network. To quantitat ...
—often termed ''resilience''—is typically specified as a requirement. However, development teams may fail to meet this requirement due to factors such as short deadlines or lack of domain knowledge. Chaos engineering encompasses techniques aimed at meeting resilience requirements.
Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.
Operational readiness using chaos engineering
Calculating how much confidence we have in the interconnected complex systems that are put into production environments requires operational readiness metrics. Operational readiness can be evaluated using chaos engineering simulations. Solutions for increasing the resilience and operational readiness of a platform include strengthening the backup, restore, network file transfer, failover capabilities and overall security of the environment.
An evaluation to induce
chaos
Chaos or CHAOS may refer to:
Science, technology, and astronomy
* '' Chaos: Making a New Science'', a 1987 book by James Gleick
* Chaos (company), a Bulgarian rendering and simulation software company
* ''Chaos'' (genus), a genus of amoebae
* ...
in a
Kubernetes
Kubernetes (), also known as K8s is an open-source software, open-source OS-level virtualization, container orchestration (computing), orchestration system for automating software deployment, scaling, and management. Originally designed by Googl ...
environment terminated random pods receiving data from edge devices in data centers while processing analytics on a big data network. The pods' recovery time was a resiliency metric that estimated the response time.
History
1983 – Apple
While
MacWrite
MacWrite is a discontinued WYSIWYG word processor released along with the first Apple Macintosh systems in 1984. Together with MacPaint, it was one of the two original "killer applications" that propelled the adoption and popularity of the GUI ...
and
MacPaint
MacPaint is a raster graphics editor developed by Apple Computer and released alongside the original Macintosh personal computer on January 24, 1984. It was sold bundled with its word processing counterpart, MacWrite, for US$195. MacPaint was n ...
were being developed for the first
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
Macintosh
Mac is a brand of personal computers designed and marketed by Apple Inc., Apple since 1984. The name is short for Macintosh (its official name until 1999), a reference to the McIntosh (apple), McIntosh apple. The current product lineup inclu ...
computer,
Steve Capps created "Monkey", a
desk accessory
A desk accessory (DA) or desklet in computing is a small transient or auxiliary application that can be run concurrently in a desktop environment with any other application on the system. Early examples, such as Sidekick and Macintosh desk accesso ...
which randomly generated
user interface
In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine fro ...
events at high speed, simulating a monkey frantically banging the keyboard and moving and clicking the mouse. It was promptly put to use for
debugging
In engineering, debugging is the process of finding the Root cause analysis, root cause, workarounds, and possible fixes for bug (engineering), bugs.
For software, debugging tactics can involve interactive debugging, control flow analysis, Logf ...
by generating errors for programmers to fix, because
automated testing was not possible; the first Macintosh had too little free memory space for anything more sophisticated.
1992 – Prologue
While
ABAL2 and
SING
Singing is the art of creating music with the voice. It is the oldest form of musical expression, and the human voice can be considered the first musical instrument. The definition of singing varies across sources. Some sources define singi ...
were being developed for the first graphical versions of the
PROLOGUE
A prologue or prolog (from Ancient Greek πρόλογος ''prólogos'', from πρό ''pró'', "before" and λόγος ''lógos'', "speech") is an opening to a story that establishes the context and gives background details, often some earlier st ...
operating system,
Iain James Marshall created "La Matraque", a
desk accessory
A desk accessory (DA) or desklet in computing is a small transient or auxiliary application that can be run concurrently in a desktop environment with any other application on the system. Early examples, such as Sidekick and Macintosh desk accesso ...
which randomly generated random sequences of both legal and invalid
graphical interface
A graphical user interface, or GUI, is a form of user interface that allows users to interact with electronic devices through graphical icons and visual indicators such as secondary notation. In many applications, GUIs are used instead of te ...
events, at high speed, thus testing the critical edge behaviour of the underlying graphics libraries. This program would be launched prior to production delivery, for days on end, thus ensuring the required degree of total resilience. This tool was subsequently extended to include the Database and other File Access instructions of the
ABAL language to check and ensure their subsequent resiliance. A variation, of this tool, is currently employed for the qualification of the modern day version known as
OPENABAL.
2003 – Amazon
While working to improve website reliability at
Amazon
Amazon most often refers to:
* Amazon River, in South America
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon (company), an American multinational technology company
* Amazons, a tribe of female warriors in Greek myth ...
,
Jesse Robbins created "Game day", an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said it was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering.
2006 – Google
While at
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
,
Kripa Krishnan created a similar program to Amazon's Game day (see above) called "DiRT" (Disaster Recovery Testing).
Jason Cahoon, a Site Reliability Engineer at Google, contributed a chapter on Google DiRT in the "Chaos Engineering" book
and described the system at the GOTOpia 2021 conference.
2011 – Netflix
While overseeing
Netflix
Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
's migration to the cloud in 2011
Nora Jones, Casey Rosenthal, and Greg Orzell
expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."
By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.
The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by
Martin Fowler in 2012.
Chaos engineering tools
Chaos Monkey
Chaos Monkey is a tool invented in 2011 by Netflix to test the
resilience of its IT infrastructure.
It works by intentionally disabling computers in Netflix's production network to test how the remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.
The name "Chaos Monkey" is explained in the book ''
Chaos Monkeys'' by Antonio Garcia Martinez:
Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand .e. flings excrement The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.
Simian Army
The Simian Army
is a suite of tools developed by
Netflix
Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple lang ...
to test the reliability, security, or resilience of its
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
infrastructure and includes the following tools:
* At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "
Region
In geography, regions, otherwise referred to as areas, zones, lands or territories, are portions of the Earth's surface that are broadly divided by physical characteristics (physical geography), human impact characteristics (human geography), and ...
". Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.
* Chaos Gorilla drops a full Amazon "
Availability Zone" (one or more entire data centers serving a geographical region).
Other
Voyages-sncf.com
SNCF Connect, formerly OUI.sncf until January 25, 2022, is a subsidiary of SNCF selling passes and point-to-point tickets for rail travel around Europe. It has commercial links to major European rail operators including SNCF, Eurostar, Deutsche ...
's 2017 "Day of Chaos"
gamified simulating pre-production failures to present at the 2017 DevOps REX conference. Founded in 2019, Steadybit popularized pre-production chaos and reliability engineering.
Its open-source Reliability Hub extends Steadybit.
Proofdock can inject infrastructure, platform, and application failures on
Microsoft Azure DevOps.
Gremlin is a "failure-as-a-service" platform.
Facebook
Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...
's Project Storm simulates datacenter failures for natural disaster resistance.
See also
*
Data redundancy
In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a com ...
*
Error detection and correction
In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
*
Fail-fast system
*
Fail fast (business)
Fail fast, also sometimes termed fail often or fail cheap, is a business management concept and theory of organizational psychology that argues businesses should encourage employees to use a trial-and-error process to quickly determine and assess ...
, a related subject in business management
*
Fall back and forward
*
Fault injection
In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied phys ...
*
Fault tolerance
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault t ...
*
Fault-tolerant computer system
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, mission-critical, or even life-critical sys ...
*
Grease (networking)
Protocol ossification is the loss of flexibility, extensibility and evolvability of network protocols. This is largely due to middleboxes that are sensitive to the wire image (networking), wire image of the protocol, and which can interrupt or int ...
*
Resilience (network)
*
Robustness (computer science)
In computer science, robustness is the ability of a computer system to cope with errors during execution1990. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990 defines robustness as "The degree to which a system or ...
*
Fuzzing
In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exceptio ...
Notes and references
External links
Principle of Chaos Engineering– The Chaos Engineering manifesto
Chaos Engineering�
Adrian HornsbyHow Chaos Engineering Practices Will Help You Design Better Software�
Mariano Calandra
{{Netflix
Netflix
Software development
Reliability engineering
Software testing
Software testing tools
Disaster recovery
Automation software
Software delivery methods