HOME

TheInfoList



OR:

In
engineering Engineering is the practice of using natural science, mathematics, and the engineering design process to Problem solving#Engineering, solve problems within technology, increase efficiency and productivity, and improve Systems engineering, s ...
and
systems theory Systems theory is the Transdisciplinarity, transdisciplinary study of systems, i.e. cohesive groups of interrelated, interdependent components that can be natural or artificial. Every system has causal boundaries, is influenced by its context, de ...
, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the
system A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its open system (systems theory), environment, is described by its boundaries, str ...
, usually in the form of a backup or
fail-safe In engineering, a fail-safe is a design feature or practice that, in the event of a failure causes, failure of the design feature, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. ...
, or to improve actual system performance, such as in the case of
GNSS A satellite navigation or satnav system is a system that uses satellites to provide autonomous geopositioning. A satellite navigation system with global coverage is termed global navigation satellite system (GNSS). , four global systems are op ...
receivers, or
multi-threaded In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide multiple threads of execution. Overview The multithreading paradigm has become more popular a ...
computer processing. In many
safety-critical system A safety-critical system or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes: * death or serious injury to people * loss or severe damage to equipment/property * environmental h ...
s, such as
fly-by-wire Fly-by-wire (FBW) is a system that replaces the conventional aircraft flight control system#Hydro-mechanical, manual flight controls of an aircraft with an electronic interface. The movements of flight controls are converted to electronic sig ...
and
hydraulic Hydraulics () is a technology and applied science using engineering, chemistry, and other sciences involving the mechanical properties and use of liquids. At a very basic level, hydraulics is the liquid counterpart of pneumatics, which concer ...
systems in
aircraft An aircraft ( aircraft) is a vehicle that is able to flight, fly by gaining support from the Atmosphere of Earth, air. It counters the force of gravity by using either Buoyancy, static lift or the Lift (force), dynamic lift of an airfoil, or, i ...
, some parts of the control system may be triplicated, which is formally termed
triple modular redundancy In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produc ...
(TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are designed to preclude common failure modes (which can then be modelled as independent failure), the probability of all three failing is calculated to be extraordinarily small; it is often outweighed by other risk factors, such as
human error Human error is an action that has been done but that was "not intended by the actor; not desired by a set of rules or an external observer; or that led the task or system outside its acceptable limits".Senders, J.W. and Moray, N.P. (1991) Human Er ...
. Electrical surges arising from
lightning Lightning is a natural phenomenon consisting of electrostatic discharges occurring through the atmosphere between two electrically charged regions. One or both regions are within the atmosphere, with the second region sometimes occurring on ...
strikes are an example of a failure mode which is difficult to fully isolate, unless the components are powered from independent power busses and have no direct electrical pathway in their interconnect (communication by some means is required for voting). Redundancy may also be known by the terms "majority voting systems" or "voting logic". Redundancy sometimes produces less, instead of greater reliability it creates a more complex system which is prone to various issues, it may lead to human neglect of duty, and may lead to higher production demands which by overstressing the system may make it less safe. Redundancy is one form of
robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, ...
as practiced in
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
. Geographic redundancy has become important in the
data center A data center is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and storage systems. Since IT operations are crucial for busines ...
industry, to safeguard data against
natural disaster A natural disaster is the very harmful impact on a society or community brought by natural phenomenon or Hazard#Natural hazard, hazard. Some examples of natural hazards include avalanches, droughts, earthquakes, floods, heat waves, landslides ...
s and political instability (see below).


Forms of redundancy

In computer science, there are four major forms of redundancy: * Hardware redundancy, such as dual modular redundancy and
triple modular redundancy In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produc ...
* Information redundancy, such as
error detection and correction In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
methods * Time redundancy, performing the same operation multiple times such as multiple executions of a program or multiple copies of data transmitted * Software redundancy such as N-version programming A modified form of software redundancy, applied to hardware may be: * Distinct functional redundancy, such as both mechanical and hydraulic braking in a car. Applied in the case of software, code written independently and distinctly different but producing the same results for the same inputs.
Structures A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
are usually designed with redundant parts as well, ensuring that if one part fails, the entire structure will not collapse. A structure without redundancy is called fracture-critical, meaning that a single broken component can cause the collapse of the entire structure. Bridges that failed due to lack of redundancy include the
Silver Bridge The Silver Bridge was an eyebar-chain suspension bridge built in 1928 which carried U.S. Route 35 over the Ohio River, connecting Point Pleasant, West Virginia, and Gallipolis, Ohio. Officially named the Point Pleasant Bridge, it was popul ...
and the Interstate 5 bridge over the Skagit River. Parallel and combined systems demonstrate different level of redundancy. The models are subject of studies in reliability and safety engineering.


Dissimilar redundancy

Unlike traditional redundancy, which uses more than one of the same thing, dissimilar redundancy uses different things. The idea is that the different things are unlikely to contain identical flaws. The voting method may involve additional complexity if the two things take different amounts of time. Dissimilar redundancy is often used with software, because identical software contains identical flaws. The chance of failure is reduced by using at least two different types of each of the following * processors, * operating systems, * software, * sensors, * types of actuators (electric, hydraulic, pneumatic, manual mechanical, etc.) * communications protocols, * communications hardware, * communications networks, * communications paths


Geographic redundancy

Geographic redundancy corrects the vulnerabilities of redundant devices deployed by geographically separating backup devices. Geographic redundancy reduces the likelihood of events such as
power outage A power outage, also called a blackout, a power failure, a power blackout, a power loss, a power cut, or a power out is the complete loss of the electrical power network supply to an end user. There are many causes of power failures in an el ...
s,
flood A flood is an overflow of water (list of non-water floods, or rarely other fluids) that submerges land that is usually dry. In the sense of "flowing water", the word may also be applied to the inflow of the tide. Floods are of significant con ...
s,
HVAC Heating, ventilation, and air conditioning (HVAC ) is the use of various technologies to control the temperature, humidity, and purity of the air in an enclosed space. Its goal is to provide thermal comfort and acceptable indoor air quality. ...
failures,
lightning Lightning is a natural phenomenon consisting of electrostatic discharges occurring through the atmosphere between two electrically charged regions. One or both regions are within the atmosphere, with the second region sometimes occurring on ...
strikes,
tornado A tornado is a violently rotating column of air that is in contact with the surface of Earth and a cumulonimbus cloud or, in rare cases, the base of a cumulus cloud. It is often referred to as a twister, whirlwind or cyclone, although the ...
es, building fires,
wildfires A wildfire, forest fire, or a bushfire is an unplanned and uncontrolled fire in an area of Combustibility and flammability, combustible vegetation. Depending on the type of vegetation present, a wildfire may be more specifically identified as a ...
, and
mass shooting A mass shooting is a violent crime in which one or more attackers use a firearm to Gun violence, kill or injure multiple individuals in rapid succession. There is no widely accepted specific definition, and different organizations tracking su ...
s disabling most of the system if not the entirety of it. Geographic redundancy locations can be * more than
continent A continent is any of several large geographical regions. Continents are generally identified by convention (norm), convention rather than any strict criteria. A continent could be a single large landmass, a part of a very large landmass, as ...
al,
Data Center Site Redundancy , H. M. Brotherton and J. Eric Dietz , Computer Information Technology, Purdue University
* more than 62 miles apart and less than apart,
Data Center Site Redundancy , H. M. Brotherton and J. Eric Dietz , Computer Information Technology, Purdue University
* less than 62 miles apart, but not on the same campus, or * different buildings that are more than apart on the same campus. The following methods can reduce the risks of damage by a fire
conflagration A conflagration is a large fire in the built environment that spreads via structure to structure ignition due to radiant or convective heat, or ember transmission. Conflagrations often damage human life, animal life, health, and/or property. A c ...
: * large buildings at least to apart, but sometimes a minimum of apart.
Factory Mutual Insurance Company , 1-20 Protection Against Exterior Fire Exposure

National Research Council , Canada , Division Of Building Research , Spatial Separation Of Buildlngs , November 1959
* high-rise buildings at least apart * open spaces clear of flammable vegetation within on each side of objects * different wings on the same building, in rooms that are separated by more than * different floors on the same wing of a building in rooms that are horizontally offset by a minimum of with fire walls between the rooms that are on different floors * two rooms separated by another room, leaving at least a 70-foot gap between the two rooms * there should be a minimum of two separated fire walls and on opposite sides of a corridor Geographic redundancy is used by Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, Netflix, Dropbox, Salesforce, LinkedIn, PayPal, Twitter, Facebook, Apple iCloud, Cisco Meraki, and many others to provide geographic redundancy, high availability, fault tolerance and to ensure availability and reliability for their cloud services. As another example, to minimize risk of damage from severe windstorms or water damage, buildings can be located at least 2 miles (3.2 km) away from the shore, with an elevation of at least 5 feet (1.5 m) above sea level. For additional protection, they can be located at least 100 feet (30 m) away from flood plain areas.


Functions of redundancy

The two functions of redundancy are passive redundancy and active redundancy. Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity. Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety. Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but
depth perception Depth perception is the ability to perceive distance to objects in the world using the visual system and visual perception. It is a major factor in perceiving the world in three dimensions. Depth sensation is the corresponding term for non-hum ...
is impaired. Hearing loss in one ear does not cause
deafness Deafness has varying definitions in cultural and medical contexts. In medical contexts, the meaning of deafness is hearing loss that precludes a person from understanding spoken language, an audiological condition. In this context it is writte ...
but directionality is lost. Performance decline is commonly associated with passive redundancy when a limited number of failures occur. Active redundancy eliminates performance declines by monitoring the performance of individual devices, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures the components. Error detection and correction and the
Global Positioning System The Global Positioning System (GPS) is a satellite-based hyperbolic navigation system owned by the United States Space Force and operated by Mission Delta 31. It is one of the global navigation satellite systems (GNSS) that provide ge ...
(GPS) are two examples of active redundancy.
Electrical power distribution Electric power distribution is the final stage in the delivery of electricity. Electricity is carried from the transmission system to individual consumers. Distribution substations connect to the transmission system and lower the transmission ...
provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line includes monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines. At the Toronto Airport, there are 4 redundant electrical lines. Each of the 4 lines supply enough power for the entire airport. A
spot network substation In electricity distribution networks, spot network substations (network transformers) are used in interconnected distribution networks. They have the secondary network (also called a grid network) with all supply transformers bussed together on ...
uses reverse current relays to open breakers to lines that fail, but lets power continue to flow the airport. Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events such as an earthquake.


Disadvantages

Charles Perrow Charles Bryce Perrow (February 9, 1925 – November 12, 2019), or Chick Perrow was an American sociologist and a leading figure of organizational sociology. He spent most of his career at SUNY Stony Brook and Yale University as a professor of ...
, author of '' Normal Accidents'', has said that sometimes redundancies backfire and produce less, not more reliability. This may happen in three ways: First, redundant safety devices result in a more complex system, more prone to errors and accidents. Second, redundancy may lead to shirking of responsibility among workers. Third, redundancy may lead to increased production pressures, resulting in a system that operates at higher speeds, but less safely.


Voting logic

Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involves computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of a form of non-computer voting logic. The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault. The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both outputs to be active or inactive at the same time, or cause outputs to flutter on and off. A more reliable form of voting logic involves an odd number of three devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with
avionics Avionics (a portmanteau of ''aviation'' and ''electronics'') are the Electronics, electronic systems used on aircraft. Avionic systems include communications, Air navigation, navigation, the display and management of multiple systems, and the ...
systems, such as those responsible for operation of the
Space Shuttle The Space Shuttle is a retired, partially reusable launch system, reusable low Earth orbital spacecraft system operated from 1981 to 2011 by the U.S. National Aeronautics and Space Administration (NASA) as part of the Space Shuttle program. ...
.


Calculating the probability of system failure

Each duplicate component added to the system decreases the probability of system failure according to the formula:- := \prod_^ p_ where: * n – number of components * p_ – probability of component i failing * p – the probability of all components failing (system failure) This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two
power supplies A power supply is an electrical device that supplies electric power to an electrical load. The main purpose of a power supply is to convert electric current from a source to the correct voltage, current, and frequency to power the load. As a r ...
connected to the same socket in such a way that if one
power supply A power supply is an electrical device that supplies electric power to an electrical load. The main purpose of a power supply is to convert electric current from a source to the correct voltage, electric current, current, and frequency to power ...
failed, the other would too. It also assumes that only one component is needed to keep the system running.


Redundancy and high availability

You can achieve higher
availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
through redundancy. Let's say you have three redundant components: A, B and C. You can use following formula to calculate
availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
of the overall system: Availability of redundant components = 1 - (1 - availability of component A) X (1 - availability of component B) X (1 - availability of component C) In corollary, if you have N parallel components each having X availability, then: Availability of parallel components = 1 - (1 - X)^ N Using redundant components can exponentially increase the availability of overall system.  For example if each of your hosts has only 50% availability, by using 10 of hosts in parallel, you can achieve 99.9023% availability. Note that redundancy doesn't always lead to higher availability. In fact, redundancy increases complexity which in turn reduces availability. According to Marc Brooker, to take advantage of redundancy, ensure that: # You achieve a net-positive improvement in the overall availability of your system # Your redundant components fail independently # Your system can reliably detect healthy redundant components # Your system can reliably scale out and scale-in redundant components.


See also

* * * * * * * * * * * * * * * * * * * *


References


External links


Secure Propulsion using Advanced Redundant Control

Using powerline as a redundant communication channel
* {{Authority control Engineering concepts Reliability engineering Safety Fault-tolerant computer systems