Fault tolerance is the property that enables a
system
A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment (systems), environment, is described by its boundaries, ...
to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in
high-availability
High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Modernization has resulted in an increased reliance on these systems. Fo ...
,
mission-critical
A mission critical factor of a system is any factor (component, equipment, personnel, process, procedure, software, etc.) that is essential to business operation or to an organization. Failure or disruption of mission critical factors will resu ...
, or even
life-critical system
A safety-critical system (SCS) or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes:
* death or serious injury to people
* loss or severe damage to equipment/property
* environme ...
s. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system
fails. The term is most commonly used to describe
computer system
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progr ...
s designed to continue more or less fully operational with, perhaps, a reduction in
throughput
Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network. The data that these messages contain may be delivered ov ...
or an increase in
response time
Response time may refer to:
*The time lag between an electronic input and the output signal which depends upon the value of passive components used.
* Responsiveness, how quickly an interactive system responds to user input
* Response time (biolog ...
in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the
hardware or the
software
Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work.
At the lowest programming level, executable code consists ...
. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured, or a structure that is able to retain its integrity in the presence of damage due to causes such as
fatigue
Fatigue describes a state of tiredness that does not resolve with rest or sleep. In general usage, fatigue is synonymous with extreme tiredness or exhaustion that normally follows prolonged physical or mental activity. When it does not resolve ...
,
corrosion
Corrosion is a natural process that converts a refined metal into a more chemically stable oxide. It is the gradual deterioration of materials (usually a metal) by chemical or electrochemical reaction with their environment. Corrosion engine ...
, manufacturing flaws, or impact.
Within the scope of an ''individual'' system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for
self-stabilization
Self-stabilization is a concept of fault-tolerance in distributed systems. Given any initial state, a self-stabilizing distributed system will end up in a correct state in a finite number of execution steps.
At first glance, the guarantee of self ...
so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.
History
The first known fault-tolerant computer was
SAPO, built in 1951 in
Czechoslovakia
, rue, Чеськословеньско, , yi, טשעכאסלאוואקיי,
, common_name = Czechoslovakia
, life_span = 1918–19391945–1992
, p1 = Austria-Hungary
, image_p1 ...
by
Antonín Svoboda.
Its basic design was
magnetic drums connected via relays, with a voting method of
memory error detection (
triple modular redundancy
Triple is used in several contexts to mean "threefold" or a " treble":
Sports
* Triple (baseball), a three-base hit
* A basketball three-point field goal
* A figure skating jump with three rotations
* In bowling terms, three strikes in a row
* I ...
). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: machines that would last a long time without any maintenance, such as the ones used on
NASA
The National Aeronautics and Space Administration (NASA ) is an independent agency of the US federal government responsible for the civil space program, aeronautics research, and space research.
NASA was established in 1958, succeeding t ...
space probe
A space probe is an artificial satellite that travels through space to collect scientific data. A space probe may orbit Earth; approach the Moon; travel through interplanetary space; flyby, orbit, or land or fly on other planetary bodies; or ent ...
s and
satellite
A satellite or artificial satellite is an object intentionally placed into orbit in outer space. Except for passive satellites, most satellites have an electricity generation system for equipment on board, such as solar panels or radioisotope ...
s; computers that were very dependable but required constant monitoring, such as those used to monitor and control
nuclear power plant
A nuclear power plant (NPP) is a thermal power station in which the heat source is a nuclear reactor. As is typical of thermal power stations, heat is used to generate steam that drives a steam turbine connected to a electric generator, generato ...
s or
supercollider experiments; and finally, computers with a high amount of runtime which would be under heavy use, such as many of the supercomputers used by
insurance companies
Insurance is a means of protection from financial loss in which, in exchange for a fee, a party agrees to compensate another party in the event of a certain loss, damage, or injury. It is a form of risk management, primarily used to hedge ...
for their
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
monitoring.
Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for
Project Apollo and other research aspects. NASA's first machine went into a
space observatory
A space telescope or space observatory is a telescope in outer space used to observe astronomical objects. Suggested by Lyman Spitzer in 1946, the first operational telescopes were the American Orbiting Astronomical Observatory, OAO-2 launched ...
, and their second attempt, the JSTAR computer, was used in
Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working, as of early 2022.
Hyper-dependable computers were pioneered mostly by
aircraft
An aircraft is a vehicle that is able to fly by gaining support from the air. It counters the force of gravity by using either static lift or by using the dynamic lift of an airfoil, or in a few cases the downward thrust from jet engines ...
manufacturers,
nuclear power
Nuclear power is the use of nuclear reactions to produce electricity. Nuclear power can be obtained from nuclear fission, nuclear decay and nuclear fusion reactions. Presently, the vast majority of electricity from nuclear power is produced b ...
companies, and the
railroad industry in the USA. These needed computers with massive amounts of uptime that would
fail gracefully
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
enough with a fault to allow continued operation while relying on the fact that the computer output would be constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of
Saturn V
Saturn V is a retired American super heavy-lift launch vehicle developed by NASA under the Apollo program for human exploration of the Moon. The rocket was human-rated, with multistage rocket, three stages, and powered with liquid-propellant r ...
rockets, but later on
BNSF
BNSF Railway is one of the largest freight railroads in North America. One of seven North American Class I railroads, BNSF has 35,000 employees, of track in 28 states, and nearly 8,000 locomotives. It has three transcontinental routes that ...
,
Unisys
Unisys Corporation is an American multinational information technology (IT) services and consulting company headquartered in Blue Bell, Pennsylvania. It provides digital workplace solutions, cloud, applications, and infrastructure solutions, e ...
, and
General Electric
General Electric Company (GE) is an American multinational conglomerate founded in 1892, and incorporated in New York state and headquartered in Boston. The company operated in sectors including healthcare, aviation, power, renewable energ ...
built their own.
In the 1970s, much work has happened in the field
. For instance,
F14 CADC
The F-14's Central Air Data Computer, also abbreviated as CADC, computes altitude, vertical speed, air speed, and mach number from sensor inputs such as pitot and static pressure and temperature. Earlier air data computer systems were electromech ...
had
built-in self-test
A built-in self-test (BIST) or built-in test (BIT) is a mechanism that permits a machine to test itself. Engineers design BISTs to meet requirements such as:
*high reliability
*lower repair cycle times
or constraints such as:
*limited technic ...
and redundancy.
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results, with the outcome that if, for example, four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.
Historically, the motion has always been to move further from N-model and more to M out of N due to the fact that the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.
Tandem
Tandem, or in tandem, is an arrangement in which a team of machines, animals or people are lined up one behind another, all facing in the same direction.
The original use of the term in English was in ''tandem harness'', which is used for two ...
and
Stratus
Stratus may refer to:
Weather
*Stratus cloud, a cloud type
**Nimbostratus cloud, a cloud type
**Stratocumulus cloud, a cloud type
**Altostratus cloud, a cloud type
**Altostratus undulatus cloud, a cloud type
**Cirrostratus cloud, a cloud type
Mus ...
were among the first companies specializing in the design of fault-tolerant computer systems for
online transaction processing In online transaction processing (OLTP), information systems typically facilitate and manage transaction-oriented applications. This is contrasted with online analytical processing.
The term "transaction" can have two different meanings, both of wh ...
.
Examples
Hardware fault tolerance sometimes requires that broken parts be taken out and replaced with new parts while the system is still operational (in computing known as ''
hot swapping
Hot swapping is the replacement or addition of components to a computer system without stopping, shutting down, or rebooting the system; hot plugging describes the addition of components only. Components which have such functionality are said ...
''). Such a system implemented with a single backup is known as single point tolerant and represents the vast majority of fault-tolerant systems. In such systems the
mean time between failure
Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a syste ...
s should be long enough for the operators to have sufficient time to fix the broken devices (
mean time to repair
Mean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. Expressed mathematically, it is the total corrective maintenance time fo ...
) before the backup also fails. It is helpful if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system.
Fault tolerance is notably successful in computer applications.
Tandem Computers
Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum up ...
built their entire business on such machines, which used single-point tolerance to create their
NonStop systems with
uptime
Uptime is a measure of system reliability, expressed as the percentage of time a machine, typically a computer, has been working and available. Uptime is the opposite of downtime.
It is often used as a measure of computer operating system reliab ...
s measured in years.
Fail-safe
In engineering, a fail-safe is a design feature or practice that in the event of a specific type of failure, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safe ...
architectures may encompass also the computer software, for example by process
replication.
Data formats may also be designed to degrade gracefully.
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
for example, is designed to be
forward compatible
Forward compatibility or upward compatibility is a design characteristic that allows a system to accept input intended for a later version of itself. The concept can be applied to entire systems, electrical interface
Interface or interfacing may ...
, allowing
Web browser
A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
s to ignore new and unsupported HTML entities without causing the document to be unusable. Additionally, some sites, including popular platforms such as Twitter (until December 2020), provide an optional lightweight front end that does not rely on
JavaScript
JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
and has a
minimal layout, to ensure wide
accessibility
Accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities. The concept of accessible design and practice of accessible development ensures both "direct access" (i. ...
and
outreach
Outreach is the activity of providing services to any population that might not otherwise have access to those services. A key component of outreach is that the group providing it is not stationary, but mobile; in other words, it involves meetin ...
, such as on
game console
A video game console is an electronic device that outputs a video signal or image to display a video game that can be played with a game controller. These may be home consoles, which are generally placed in a permanent location connected to a t ...
s with limited web browsing capabilities.
Terminology
A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.
A system that is designed to
fail safe
In engineering, a fail-safe is a design feature or practice that in the event of a specific type of failure, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safet ...
, or fail-secure, or fail gracefully, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a
graceful exit
A graceful exit (or graceful handling) is a simple programming idiom wherein a program detects a serious error condition and "exits gracefully" in a controlled manner as a result. Often the program prints a descriptive error message to a terminal o ...
(as opposed to an uncontrolled crash) in order to prevent data corruption after experiencing an error. A similar distinction is made between "failing well" and "
failing badly
Failing badly and failing well are concepts in systems security and network security (and engineering in general) describing how a system reacts to failure. The terms have been popularized by Bruce Schneier, a cryptographer and security consultant ...
".
A system that is designed to experience graceful degradation, or to fail soft (used in computing, similar to "fail safe") operates at a reduced level of performance after some component failures. For example, a building may operate lighting at reduced levels and elevators at reduced speeds if grid power fails, rather than either trapping people in the dark completely or continuing to operate at full power. In computing an example of graceful degradation is that if insufficient network bandwidth is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version.
Progressive enhancement
Progressive enhancement is a strategy in web design that puts emphasis on web content first, allowing everyone to access the basic content and functionality of a web page, whilst users with additional browser features or faster Internet access r ...
is an example in computing, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display available.
In
fault-tolerant computer system
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
s, programs that are considered
robust
Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely.
Software brittleness
In computer programming and software engineering, software brittleness is the increased difficulty in fixing older software that may appear reliable, but actually fails badly when presented with unusual data or altered in a seemingly minor way. The ...
is the opposite of robustness.
Resilient networks continue to transmit data despite the failure of some links or nodes;
resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.
A system with high
failure transparency
In a distributed system, failure transparency refers to the extent to which errors and subsequent recoveries of hosts and services within the system are invisible to users and applications.
For example, if a server fails, but users are automatical ...
will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. Likewise, a
fail-fast
In systems design, a fail-fast system is one which immediately reports at its interface any condition that is likely to indicate a failure. Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly f ...
component is designed to report at the first point of failure, rather than allow downstream components to fail and generate reports then. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.
A single fault condition is a situation where one means for
protection
Protection is any measure taken to guard a thing against damage caused by outside forces. Protection can be provided to physical objects, including organisms, to systems, and to intangible things like civil and political rights. Although th ...
against a
hazard
A hazard is a potential source of harm
Harm is a moral and legal concept.
Bernard Gert construes harm as any of the following:
* pain
* death
* disability
* mortality
* loss of abil ity or freedom
* loss of pleasure.
Joel Feinberg giv ...
is defective. If a single fault condition results unavoidably in another single fault condition, the two failures are considered as one single fault condition. A source offers the following example:
Criteria
Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:
* How critical is the component? In a car, the radio is not critical, so this component has less need for fault tolerance.
* How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
* How expensive is it to make the component fault tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
An example of a component that passes all the tests is a car's occupant restraint system. While we do not normally think of the ''primary'' occupant restraint system, it is
gravity
In physics, gravity () is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is, by far, the weakest of the four fundamental interactions, approximately 1038 times weaker than the stro ...
. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents causing occupant ejection were quite common before
seat belt
A seat belt (also known as a safety belt, or spelled seatbelt) is a vehicle safety device designed to secure the driver or a passenger of a vehicle against harmful movement that may result during a collision or a sudden stop. A seat belt reduc ...
s, so we pass the second test. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as
airbag
An airbag is a vehicle occupant-restraint system using a bag designed to inflate extremely quickly, then quickly deflate during a collision. It consists of the airbag cushion, a flexible fabric bag, an inflation module, and an impact sensor. Th ...
s, are more expensive and so pass that test by a smaller margin.
Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness). Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure, the loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely (a relatively very rare occurrence), there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission/engine braking so long as the demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case.
In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring it to a halt.
On motorcycles, a similar level of fail-safety is provided by simpler methods; firstly the front and rear brake systems being entirely separate, regardless of their method of activation (that can be cable, rod or hydraulic), allowing one to fail entirely whilst leaving the other unaffected. Secondly, the rear brake is relatively strong compared to its automotive cousin, even being a powerful disc on sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tyre is generally larger and grippier, and the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before the wheel locks up. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.
Requirements
The basic characteristics of fault tolerance require:
# No
single point of failure
A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software appl ...
– If a system experiences a failure, it must continue to operate without interruption during the repair process.
#
Fault isolation
Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be ...
to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The
National Institute of Standards and Technology
The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical sci ...
(NIST) categorizes faults based on locality, cause, duration, and effect.
# Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "rogue transmitter" that can swamp legitimate communication in a system and cause overall system failure.
Firewall
Firewall may refer to:
* Firewall (computing), a technological barrier designed to prevent unauthorized or unwanted communications between computer networks or hosts
* Firewall (construction), a barrier inside a building, designed to limit the spre ...
s or other mechanisms that isolate a rogue transmitter or failing component to protect the system are required.
# Availability of
reversion modes
In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called
availability
In reliability engineering, the term availability has the following meanings:
* The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at a ...
and is expressed as a percentage. For example, a
five nines system would statistically provide 99.999% availability.
Fault-tolerant systems are typically based on the concept of redundancy.
Fault tolerance techniques
Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport,
public utilities
A public utility company (usually just utility) is an organization that maintains the infrastructure for a public service (often also providing a service using that infrastructure). Public utilities are subject to forms of public control and r ...
and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as
software modeling
A modeling language is any artificial language that can be used to express information or knowledge or systems in a structure that is defined by a consistent set of rules. The rules are used for interpretation of the meaning of components in the ...
and reliability, or
hardware design
Processor design is a subfield of computer engineering and electronics engineering (fabrication) that deals with creating a processor, a key component of computer hardware.
The design process involves choosing an instruction set and a certain exec ...
, to arcane elements such as
stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...
models,
graph theory
In mathematics, graph theory is the study of ''graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of '' vertices'' (also called ''nodes'' or ''points'') which are conne ...
, formal or exclusionary logic,
parallel processing, remote
data transmission
Data transmission and data reception or, more broadly, data communication or digital communications is the transfer and reception of data in the form of a digital bitstream or a digitized analog signal transmitted over a point-to-point o ...
, and more.
Replication
Spare components address the first fundamental characteristic of fault tolerance in three ways:
*
Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in
parallel
Parallel is a geometric term of location which may refer to:
Computing
* Parallel algorithm
* Parallel computing
* Parallel metaheuristic
* Parallel (software), a UNIX utility for running programs in parallel
* Parallel Sysplex, a cluster of IBM ...
, and choosing the correct result on the basis of a
quorum
A quorum is the minimum number of members of a deliberative assembly (a body that uses parliamentary procedure, such as a legislature) necessary to conduct the business of that group. According to ''Robert's Rules of Order Newly Revised'', the ...
;
*
Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (
failover
Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer net ...
);
* Diversity: Providing multiple ''different'' implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
All implementations of
RAID
Raid, RAID or Raids may refer to:
Attack
* Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground
* Corporate raid, a type of hostile takeover in business
* Panty raid, a prankish raid by male college ...
,
redundant array of independent disks
Raid, RAID or Raids may refer to:
Attack
* Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground
* Corporate raid, a type of hostile takeover in business
* Panty raid, a prankish raid by male college ...
, except RAID 0, are examples of a fault-tolerant
storage device that uses
data redundancy In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a compl ...
.
A
lockstep
In the United States, lockstep marching or simply lockstep is marching in a very close single file in such a way that the leg of each person in the file moves in the same way and at the same time as the corresponding leg of the person immediately ...
fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each
replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed
dual modular redundant
In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particu ...
(DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed
triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep
In the United States, lockstep marching or simply lockstep is marching in a very close single file in such a way that the leg of each person in the file moves in the same way and at the same time as the corresponding leg of the person immediately ...
fault-tolerant machines are most easily made fully
synchronous
Synchronization is the coordination of events to operate a system in unison. For example, the conductor of an orchestra keeps the orchestra synchronized or ''in time''. Systems that operate with all parts in synchrony are said to be synchronou ...
, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.
Failure-oblivious computing
''Failure-oblivious computing'' is a technique that enables
computer programs
A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components.
A computer program i ...
to continue executing despite
errors. The technique can be applied in different contexts. First, it can handle invalid memory reads by returning a manufactured value to the program, which in turn, makes use of the manufactured value and ignores the former
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...
value it tried to access, this is a great contrast to
typical memory checkers, which inform the program of the error or abort the program. Second, it can be applied to exceptions where some catch blocks are written or synthesized to catch unexpected exceptions. Furthermore, it happens that the execution is modified several times in a row, in order to prevent cascading failures.
The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%.
Recovery shepherding
Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero.
Comparing to the failure oblivious computing technique, recovery shepherding works on the compiled program binary directly and does not need to recompile to program.
It uses the
just-in-time binary instrumentation framework
Pin
A pin is a device used for fastening objects or material together.
Pin or PIN may also refer to:
Computers and technology
* Personal identification number (PIN), to access a secured system
** PIN pad, a PIN entry device
* PIN, a former Dutch ...
. It attaches to the application process when an error occurs, repairs the execution,
tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from the process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead.
[ For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs.][
]
Circuit breaker
The circuit breaker design pattern
Circuit breaker is a Design pattern (computer science), design pattern used in software development. It is used to detect failures and encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external ...
is a technique to avoid catastrophic failures in distributed systems.
Redundancy
Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment.
This can consist of backup components that automatically "kick in" if one component fails. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with the exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail).
The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann
John von Neumann (; hu, Neumann János Lajos, ; December 28, 1903 – February 8, 1957) was a Hungarian-American mathematician, physicist, computer scientist, engineer and polymath. He was regarded as having perhaps the widest cove ...
in the 1950s.
Two kinds of redundancy are possible:[Avizienis, A. (1976).]
Fault-Tolerant Systems
, IEEE Transactions on Computers, vol. 25, no. 12, pp. 1304–1312 space redundancy and time redundancy. Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this kind of testing is referred to as 'In Service Fault Tolerance Testing or ISFTT for short.
Disadvantages
Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:
* Interference with fault detection in the same component. To continue the above passenger vehicle example, with either of the fault-tolerant systems it may not be obvious to the driver when a tire has been punctured. This is usually handled with a separate "automated fault-detection system". In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is a "manual fault-detection system", such as manually inspecting all tires at each stop.
* Interference with fault detection in another component. Another variation of this problem is when fault tolerance in one component prevents fault detection in a different component. For example, if component B performs some operation based on the output from component A, then fault tolerance in B can hide a problem with A. If component B is later changed (to a less fault-tolerant design) the system may fail suddenly, making it appear that the new component B is the problem. Only after the system has been carefully scrutinized will it become clear that the root problem is actually with component A.
* Reduction of priority of fault correction. Even if the operator is aware of the fault, having a fault-tolerant system is likely to reduce the importance of repairing the fault. If the faults are not corrected, this will eventually lead to system failure, when the fault-tolerant component fails completely or when all redundant components have also failed.
* Test difficulty. For certain critical fault-tolerant systems, such as a nuclear reactor
A nuclear reactor is a device used to initiate and control a fission nuclear chain reaction or nuclear fusion reactions. Nuclear reactors are used at nuclear power plants for electricity generation and in nuclear marine propulsion. Heat from nu ...
, there is no easy way to verify that the backup components are functional. The most infamous example of this is Chernobyl
Chernobyl ( , ; russian: Чернобыль, ) or Chornobyl ( uk, Чорнобиль, ) is a partially abandoned city in the Chernobyl Exclusion Zone, situated in the Vyshhorod Raion of northern Kyiv Oblast, Ukraine. Chernobyl is about no ...
, where operators tested the emergency backup cooling by disabling primary and secondary cooling. The backup failed, resulting in a core meltdown and massive release of radiation.
* Cost. Both fault-tolerant components and redundant components tend to increase cost. This can be a purely economic cost or can include other measures, such as weight. Manned spaceships, for example, have so many redundant and fault-tolerant components that their weight is increased dramatically over unmanned systems, which don't require the same level of safety.
* Inferior components. A fault-tolerant design may allow for the use of inferior components, which would have otherwise made the system inoperable. While this practice has the potential to mitigate the cost increase, use of multiple inferior components may lower the reliability of the system to a level equal to, or even worse than, a comparable non-fault-tolerant system.
Related terms
There is a difference between fault tolerance and systems that rarely have problems. For instance, the Western Electric
The Western Electric Company was an American electrical engineering and manufacturing company officially founded in 1869. A wholly owned subsidiary of American Telephone & Telegraph for most of its lifespan, it served as the primary equipment ma ...
crossbar
Crossbar may refer to:
Structures
* Latch (hardware), a post barring a door
* Top tube of a bicycle frame
* Crossbar, the horizontal member of various sports goals
* Crossbar, a horizontal member of an electricity pylon
Other
* In electronic ...
systems had failure rates of two hours per forty years, and therefore were highly ''fault resistant''. But when a fault did occur they still stopped operating completely, and therefore were not ''fault tolerant''.
See also
*Byzantine fault tolerance
A Byzantine fault (also Byzantine generals problem, interactive consistency, source congruency, error avalanche, Byzantine agreement problem, and Byzantine failure) is a condition of a computer system, particularly distributed computing systems, ...
*Control reconfiguration Control reconfiguration is an active approach in control theory to achieve fault-tolerant control for dynamic systems. It is used when severe faults, such as actuator or sensor outages, cause a break-up of the control loop, which must be restruct ...
*Damage tolerance
In engineering, damage tolerance is a property of a structure relating to its ability to sustain defects safely until repair can be effected. The approach to engineering design to account for damage tolerance is based on the assumption that flaws ...
*Data redundancy In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a compl ...
*Defence in depth
Defence in depth (also known as deep defence or elastic defence) is a military strategy that seeks to delay rather than prevent the advance of an attacker, buying time and causing additional casualties by yielding space. Rather than defeating ...
*Ecological resilience
In ecology, resilience is the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly. Such perturbations and disturbances can include stochastic events such as fires, flooding, windstorms ...
*Elegant degradation Elegant degradation is a term used in engineering to describe what occurs to machines which are subject to constant, repetitive stress.
Externally, such a machine maintains the same appearance to the user, appearing to function properly. Internally ...
*Error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
*Error-tolerant design
An error-tolerant design (also: human-error-tolerant design) is one that does not unduly penalize user or human errors. It is the human equivalent of fault tolerant design that allows equipment to continue functioning in the presence of hardware ...
(human error
Human error refers to something having been done that was " not intended by the actor; not desired by a set of rules or an external observer; or that led the task or system outside its acceptable limits".Senders, J.W. and Moray, N.P. (1991) Human ...
-tolerant design)
*Fail-safe
In engineering, a fail-safe is a design feature or practice that in the event of a specific type of failure, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safe ...
*Failure semantics In distributed computing, failure semantics is used to describe and classify errors that distributed systems can experience., pp 14–16.
Types of errors
A list of types of errors that can occur:
* An omission error is when one or more responses ...
* Fall back and forward
*Graceful exit
A graceful exit (or graceful handling) is a simple programming idiom wherein a program detects a serious error condition and "exits gracefully" in a controlled manner as a result. Often the program prints a descriptive error message to a terminal o ...
*Intrusion tolerance {{Primary sources, date=December 2013
Intrusion tolerance is a fault-tolerant design approach to defending information systems against malicious attacks. In that sense, it is also a computer security approach. Abandoning the conventional aim of prev ...
*List of system quality attributes
Within systems engineering, quality attributes are realized non-functional requirements used to evaluate the performance of a system. These are sometimes named architecture characteristics, or "ilities" after the suffix many of the words share. ...
*Progressive enhancement
Progressive enhancement is a strategy in web design that puts emphasis on web content first, allowing everyone to access the basic content and functionality of a web page, whilst users with additional browser features or faster Internet access r ...
*Resilience (network)
In computer networking, resilience is the ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation." Threats and challenges for services can range from simple misconfiguration over ...
*Robustness (computer science)
In computer science, robustness is the ability of a computer system to cope with errors during execution1990. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990 defines robustness as "The degree to which a system o ...
*Rollback (data management)
In database technologies, a rollback is an operation which returns the database to some previous state. Rollbacks are important for database integrity, because they mean that the database can be restored to a clean copy even after erroneous opera ...
*Self-management (computer science) Self-management may refer to:
* Self-care, when one's health is under individual control, deliberate, and self-initiated
* Self-managed economy, based on autonomous self-regulating economic units and a decentralised mechanism of resource allocation ...
*Crash-only software
Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery. Correctly written components of crash-only software can microreboot to a known-good state without the help ...
References
{{Authority control
Reliability engineering
Computer systems
Control engineering
Systems engineering
Software quality
RAID