
Safety engineering is an
engineering
Engineering is the practice of using natural science, mathematics, and the engineering design process to Problem solving#Engineering, solve problems within technology, increase efficiency and productivity, and improve Systems engineering, s ...
discipline
Discipline is the self-control that is gained by requiring that rules or orders be obeyed, and the ability to keep working at something that is difficult. Disciplinarians believe that such self-control is of the utmost importance and enforce a ...
which assures that engineered
system
A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its open system (systems theory), environment, is described by its boundaries, str ...
s provide acceptable levels of
safety
Safety is the state of being protected from harm or other danger. Safety can also refer to the control of recognized hazards in order to achieve an acceptable level of risk.
Meanings
The word 'safety' entered the English language in the 1 ...
. It is strongly related to
industrial engineering
Industrial engineering (IE) is concerned with the design, improvement and installation of integrated systems of people, materials, information, equipment and energy. It draws upon specialized knowledge and skill in the mathematical, physical, an ...
/
systems engineering
Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage complex systems over their Enterprise life cycle, life cycles. At its core, systems engineering uti ...
, and the subset
system safety
The system safety concept calls for a risk management strategy based on identification, analysis of hazards and application of remedial controls using a systems-based approach. This is different from traditional safety strategies which rely on co ...
engineering. Safety engineering assures that a
life-critical system
A safety-critical system or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes:
* death or serious injury to people
* loss or severe damage to equipment/property
* environmental h ...
behaves as needed, even when components
fail.
Analysis techniques
Analysis techniques can be split into two categories:
qualitative and
quantitative
Quantitative may refer to:
* Quantitative research, scientific investigation of quantitative properties
* Quantitative analysis (disambiguation)
* Quantitative verse, a metrical system in poetry
* Statistics, also known as quantitative analysis
...
methods. Both approaches share the goal of finding causal dependencies between a
hazard
A hazard is a potential source of harm. Substances, events, or circumstances can constitute hazards when their nature would potentially allow them to cause damage to health, life, property, or any other interest of value. The probability of that ...
on system level and failures of individual components. Qualitative approaches focus on the question "What must go wrong, such that a system hazard may occur?", while quantitative methods aim at providing estimations about probabilities, rates and/or severity of consequences.
The complexity of the technical systems such as Improvements of Design and Materials, Planned Inspections, Fool-proof design, and Backup Redundancy decreases risk and increases the cost. The risk can be decreased to ALARA (as low as reasonably achievable) or ALAPA (as low as practically achievable) levels.
Traditionally, safety analysis techniques rely solely on skill and expertise of the safety engineer. In the last decade
model-based approaches, like STPA (Systems Theoretic Process Analysis), have become prominent. In contrast to traditional methods, model-based techniques try to derive relationships between causes and consequences from some sort of model of the system.
Traditional methods for safety analysis
The two most common fault modeling techniques are called
failure mode and effects analysis
Failure is the social concept of not meeting a desirable or intended Goal, objective, and is usually viewed as the opposite of success. The criteria for failure depends on context, and may be relative to a particular observer or belief system ...
(FMEA) and
fault tree analysis
Fault tree analysis (FTA) is a type of failure analysis in which an undesired state of a system is examined. This analysis method is mainly used in safety engineering and reliability engineering to understand how systems can fail, to identify the ...
(FTA). These techniques are just ways of finding problems and of making plans to cope with failures, as in
probabilistic risk assessment
Probabilistic risk assessment (PRA) is a systematic and comprehensive methodology to evaluate risks associated with a complex engineered technological entity (such as an airliner or a nuclear power plant) or the effects of stressors on the environ ...
. One of the earliest complete studies using this technique on a commercial nuclear plant was the
WASH-1400 study, also known as the Reactor Safety Study or the Rasmussen Report.
Failure modes and effects analysis
Failure Mode and Effects Analysis (FMEA) is a bottom-up,
inductive analytical method which may be performed at either the functional or piece-part level. For functional FMEA, failure modes are identified for each function in a system or equipment item, usually with the help of a functional
block diagram
A block diagram is a diagram of a system in which the principal parts or functions are represented by blocks connected by lines that show the relationships of the blocks. . For piece-part FMEA, failure modes are identified for each piece-part component (such as a valve, connector, resistor, or diode). The effects of the failure mode are described, and assigned a probability based on the
failure rate
Failure is the social concept of not meeting a desirable or intended objective, and is usually viewed as the opposite of success. The criteria for failure depends on context, and may be relative to a particular observer or belief system. On ...
and failure mode ratio of the function or component. This quantization is difficult for software ---a bug exists or not, and the failure models used for hardware components do not apply. Temperature and age and manufacturing variability affect a resistor; they do not affect software.
Failure modes with identical effects can be combined and summarized in a Failure Mode Effects Summary. When combined with criticality analysis, FMEA is known as
Failure Mode, Effects, and Criticality Analysis
Failure mode effects and criticality analysis (FMECA) is an extension of failure mode and effects analysis (FMEA).
FMEA is a bottom-up, inductive analytical method which may be performed at either the functional or piece-part level. FMECA exten ...
or FMECA.
Fault tree analysis
Fault tree analysis (FTA) is a top-down,
deductive
Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false. For example, th ...
analytical method. In FTA, initiating primary events such as component failures, human errors, and external events are traced through
Boolean logic
In mathematics and mathematical logic, Boolean algebra is a branch of algebra. It differs from elementary algebra in two ways. First, the values of the variable (mathematics), variables are the truth values ''true'' and ''false'', usually denot ...
gates to an undesired top event such as an aircraft crash or nuclear reactor core melt. The intent is to identify ways to make top events less probable, and verify that safety goals have been achieved.

Fault trees are a logical inverse of success trees, and may be obtained by applying
de Morgan's theorem to success trees (which are directly related to
reliability block diagrams).
FTA may be qualitative or quantitative. When failure and event probabilities are unknown, qualitative fault trees may be analyzed for minimal cut sets. For example, if any minimal cut set contains a single base event, then the top event may be caused by a single failure. Quantitative FTA is used to compute top event probability, and usually requires computer software such as CAFTA from the
Electric Power Research Institute
EPRI, is an American independent, nonprofit organization that conducts research and development related to the generation, delivery, and use of electricity to help address challenges in the energy industry, including reliability, efficiency, affo ...
or
SAPHIRE from the
Idaho National Laboratory
Idaho National Laboratory (INL) is one of the national laboratories of the United States Department of Energy and is managed by the Battelle Energy Alliance. Historically, the lab has been involved with nuclear research, although the labora ...
.
Some industries use both fault trees and
event trees. An event tree starts from an undesired initiator (loss of critical supply, component failure etc.) and follows possible further system events through to a series of final consequences. As each new event is considered, a new node on the tree is added with a split of probabilities of taking either branch. The probabilities of a range of "top events" arising from the initial event can then be seen.
Oil and gas industry offshore (API 14C; ISO 10418)
The offshore oil and gas industry uses a qualitative safety systems analysis technique to ensure the protection of offshore production systems and platforms. The analysis is used during the design phase to identify process engineering hazards together with risk mitigation measures. The methodology is described in the
American Petroleum Institute
The American Petroleum Institute (API) is the largest U.S. trade association for the oil and natural gas industry. It claims to represent nearly 600 corporations involved in extraction of petroleum, production, oil refinery, refinement, pipeline ...
Recommended Practice 14C ''Analysis, Design, Installation, and Testing of Basic Surface Safety Systems for Offshore Production Platforms.''
The technique uses system analysis methods to determine the safety requirements to protect any individual process component, e.g. a vessel,
pipeline
A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
, or
pump
A pump is a device that moves fluids (liquids or gases), or sometimes Slurry, slurries, by mechanical action, typically converted from electrical energy into hydraulic or pneumatic energy.
Mechanical pumps serve in a wide range of application ...
.
[API RP 14C p.1] The safety requirements of individual components are integrated into a complete platform safety system, including liquid containment and emergency support systems such as fire and gas detection.
The first stage of the analysis identifies individual process components, these can include: flowlines, headers,
pressure vessel
A pressure vessel is a container designed to hold gases or liquids at a pressure substantially different from the ambient pressure.
Construction methods and materials may be chosen to suit the pressure application, and will depend on the size o ...
s, atmospheric vessels,
fired heaters, exhaust heated components, pumps,
compressor
A compressor is a mechanical device that increases the pressure of a gas by reducing its volume. An air compressor is a specific type of gas compressor.
Many compressors can be staged, that is, the gas is compressed several times in steps o ...
s, pipelines and
heat exchanger
A heat exchanger is a system used to transfer heat between a source and a working fluid. Heat exchangers are used in both cooling and heating processes. The fluids may be separated by a solid wall to prevent mixing or they may be in direct contac ...
s.
[API RP 14C p.vi] Each component is subject to a safety analysis to identify undesirable events (equipment failure, process upsets, etc.) for which protection must be provided.
[API RP 14C p.15-16] The analysis also identifies a detectable condition (e.g.
high pressure
In science and engineering the study of high pressure examines its effects on materials and the design and construction of devices, such as a diamond anvil cell, which can create high pressure. ''High pressure'' usually means pressures of thousan ...
) which is used to initiate actions to prevent or minimize the effect of undesirable events. A Safety Analysis Table (SAT) for pressure vessels includes the following details.
[API RP 14C p.28]
Other undesirable events for a pressure vessel are under-pressure, gas blowby, leak, and excess temperature together with their associated causes and detectable conditions.

Once the events, causes and detectable conditions have been identified the next stage of the methodology uses a Safety Analysis Checklist (SAC) for each component. This lists the safety devices that may be required or factors that negate the need for such a device. For example, for the case of liquid overflow from a vessel (as above) the SAC identifies:
* A4.2d - High level sensor (LSH)
** 1. LSH installed.
** 2. Equipment downstream of gas outlet is not a flare or vent system and can safely handle maximum liquid carry-over.
** 3. Vessel function does not require handling of separate fluid phases.
** 4. Vessel is a small trap from which liquids are manually drained.

The analysis ensures that two levels of protection are provided to mitigate each undesirable event. For example, for a pressure vessel subjected to over-pressure the primary protection would be a PSH (pressure switch high) to shut off inflow to the vessel, secondary protection would be provided by a
pressure safety valve
A safety valve is a valve that acts as a fail-safe. An example of safety valve is a pressure relief valve (PRV), which automatically releases a substance from a boiler, pressure vessel, or other system, when the pressure or temperature exceeds ...
(PSV) on the vessel.
The next stage of the analysis relates all the sensing devices, shutdown valves (ESVs), trip systems and emergency support systems in the form of a Safety Analysis Function Evaluation (SAFE) chart.
X denotes that the detection device on the left (e.g. PSH) initiates the shutdown or warning action on the top right (e.g. ESV closure).
The SAFE chart constitutes the basis of Cause and Effect Charts which relate the sensing devices to
shutdown valves and plant trips which defines the functional architecture of the
process shutdown system.
The methodology also specifies the systems testing that is necessary to ensure the functionality of the protection systems.
API RP 14C was first published in June 1974. The 8th edition was published in February 2017. API RP 14C was adapted as ISO standard ISO 10418 in 1993 entitled ''Petroleum and natural gas industries — Offshore production installations — Analysis, design, installation and testing of basic surface process safety systems.'' The latest edition of ISO 10418 was published in 2019.
Safety certification
Typically, safety guidelines prescribe a set of steps, deliverable documents, and exit criterion focused around planning, analysis and design, implementation, verification and validation, configuration management, and quality assurance activities for the development of a safety-critical system. In addition, they typically formulate expectations regarding the creation and use of
traceability
Traceability is the capability to trace something. In some cases, it is interpreted as the ability to verify the history, location, or application of an item by means of documented recorded identification.
Other common definitions include the capa ...
in the project. For example, depending upon the criticality level of a requirement, the
US Federal Aviation Administration guideline
DO-178B/C requires
traceability
Traceability is the capability to trace something. In some cases, it is interpreted as the ability to verify the history, location, or application of an item by means of documented recorded identification.
Other common definitions include the capa ...
from
requirement
In engineering, a requirement is a condition that must be satisfied for the output of a work effort to be acceptable. It is an explicit, objective, clear and often quantitative description of a condition to be satisfied by a material, design, pro ...
s to
design
A design is the concept or proposal for an object, process, or system. The word ''design'' refers to something that is or has been intentionally created by a thinking agent, and is sometimes used to refer to the inherent nature of something ...
, and from
requirement
In engineering, a requirement is a condition that must be satisfied for the output of a work effort to be acceptable. It is an explicit, objective, clear and often quantitative description of a condition to be satisfied by a material, design, pro ...
s to
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
and executable
object code
In computing, object code or object module is the product of an assembler or compiler
In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' ...
for software components of a system. Thereby, higher quality traceability information can simplify the certification process and help to establish trust in the maturity of the applied development process.
Usually a failure in safety-
certified
Certification is part of testing, inspection and certification and the provision by an independent body of written assurance (a certificate) that the product, service or system in question meets specific requirements. It is the formal attestatio ...
systems is acceptable if, on average, less than one life per 10
9 hours of continuous operation is lost to failure. Most Western
nuclear reactors
A nuclear reactor is a device used to initiate and control a fission nuclear chain reaction. They are used for commercial electricity, marine propulsion, weapons production and research. Fissile nuclei (primarily uranium-235 or plutonium-2 ...
, medical equipment, and commercial
aircraft
An aircraft ( aircraft) is a vehicle that is able to flight, fly by gaining support from the Atmosphere of Earth, air. It counters the force of gravity by using either Buoyancy, static lift or the Lift (force), dynamic lift of an airfoil, or, i ...
are certified to this level. The cost versus loss of lives has been considered appropriate at this level (by
FAA for aircraft systems under
Federal Aviation Regulations
The Federal Aviation Regulations (FARs) are rules prescribed by the Federal Aviation Administration (FAA) governing all aviation activities in the United States. The FARs comprise Title 14 of the Code of Federal Regulations (14 CFR). A wide var ...
).
Preventing failure
Once a failure mode is identified, it can usually be mitigated by adding extra or redundant equipment to the system. For example, nuclear reactors contain dangerous
radiation
In physics, radiation is the emission or transmission of energy in the form of waves or particles through space or a material medium. This includes:
* ''electromagnetic radiation'' consisting of photons, such as radio waves, microwaves, infr ...
, and nuclear reactions can cause so much
heat
In thermodynamics, heat is energy in transfer between a thermodynamic system and its surroundings by such mechanisms as thermal conduction, electromagnetic radiation, and friction, which are microscopic in nature, involving sub-atomic, ato ...
that no substance might contain them. Therefore, reactors have emergency core cooling systems to keep the temperature down, shielding to contain the radiation, and engineered barriers (usually several, nested, surmounted by a
containment building
A containment building is a reinforced steel, concrete or lead structure enclosing a nuclear reactor. It is designed, in any emergency, to contain the escape of radioactive steam or gas to a maximum pressure in the range of . The containment is ...
) to prevent accidental leakage.
Safety-critical system
A safety-critical system or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes:
* death or serious injury to people
* loss or severe damage to equipment/property
* environmental h ...
s are commonly required to permit no
single event or component failure to result in a catastrophic failure mode.
Most
biological
Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, origin, evolution, and distribution of ...
organisms have a certain amount of redundancy: multiple organs, multiple limbs, etc.
For any given failure, a fail-over or redundancy can almost always be designed and incorporated into a system.
There are two categories of techniques to reduce the probability of failure:
Fault avoidance techniques increase the reliability of individual items (increased design margin, de-rating, etc.).
Fault tolerance techniques increase the reliability of the system as a whole (redundancies, barriers, etc.).
Safety and reliability
Safety engineering and reliability engineering have much in common, but safety is not reliability. If a medical device fails, it should fail safely; other alternatives will be available to the surgeon. If the engine on a single-engine aircraft fails, there is no backup. Electrical power grids are designed for both safety and reliability; telephone systems are designed for reliability, which becomes a safety issue when emergency (e.g. US
911
911, 9/11 or Nine Eleven may refer to:
Dates
* AD 911
* 911 BC
* September 11
** The 2001 September 11 attacks on the United States by al-Qaeda, commonly referred to as 9/11
** 11 de Septiembre, Chilean coup d'état in 1973 that ousted the ...
) calls are placed.
Probabilistic risk assessment
Probabilistic risk assessment (PRA) is a systematic and comprehensive methodology to evaluate risks associated with a complex engineered technological entity (such as an airliner or a nuclear power plant) or the effects of stressors on the environ ...
has created a close relationship between safety and reliability. Component reliability, generally defined in terms of component
failure rate
Failure is the social concept of not meeting a desirable or intended objective, and is usually viewed as the opposite of success. The criteria for failure depends on context, and may be relative to a particular observer or belief system. On ...
, and external event probability are both used in quantitative safety assessment methods such as FTA. Related probabilistic methods are used to determine system
Mean Time Between Failure (MTBF), system availability, or probability of mission success or failure. Reliability analysis has a broader scope than safety analysis, in that non-critical failures are considered. On the other hand, higher failure rates are considered acceptable for non-critical systems.
Safety generally cannot be achieved through component reliability alone. Catastrophic failure probabilities of 10
−9 per hour correspond to the failure rates of very simple components such as
resistor
A resistor is a passive two-terminal electronic component that implements electrical resistance as a circuit element. In electronic circuits, resistors are used to reduce current flow, adjust signal levels, to divide voltages, bias active e ...
s or
capacitor
In electrical engineering, a capacitor is a device that stores electrical energy by accumulating electric charges on two closely spaced surfaces that are insulated from each other. The capacitor was originally known as the condenser, a term st ...
s. A complex system containing hundreds or thousands of components might be able to achieve a MTBF of 10,000 to 100,000 hours, meaning it would fail at 10
−4 or 10
−5 per hour. If a system failure is catastrophic, usually the only practical way to achieve 10
−9 per hour failure rate is through redundancy.
When adding equipment is impractical (usually because of expense), then the least expensive form of design is often "inherently fail-safe". That is, change the system design so its failure modes are not catastrophic. Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment.
The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way (for nuclear power plants, this is termed a
passively safe
Passive nuclear safety is a design approach for safety features, implemented in a nuclear reactor, that does not require any active intervention on the part of the operator or electrical/electronic feedback in order to bring the reactor to a saf ...
design, although more than ordinary failures are covered). Alternately, if the system contains a hazard source such as a battery or rotor, then it may be possible to remove the hazard from the system so that its failure modes cannot be catastrophic. The U.S. Department of Defense Standard Practice for System Safety (MIL–STD–882) places the highest priority on elimination of hazards through design selection.
One of the most common fail-safe systems is the overflow tube in baths and kitchen sinks. If the valve sticks open, rather than causing an overflow and damage, the tank spills into an overflow. Another common example is that in an
elevator
An elevator (American English) or lift (Commonwealth English) is a machine that vertically transports people or freight between levels. They are typically powered by electric motors that drive traction cables and counterweight systems suc ...
the cable supporting the car keeps
spring-loaded brakes open. If the cable breaks, the brakes grab rails, and the elevator cabin does not fall.
Some systems can never be made fail safe, as continuous availability is needed. For example, loss of engine thrust in flight is dangerous. Redundancy, fault tolerance, or recovery procedures are used for these situations (e.g. multiple independent controlled and fuel fed engines). This also makes the system less sensitive for the reliability prediction errors or quality induced uncertainty for the separate items. On the other hand, failure detection & correction and avoidance of common cause failures becomes here increasingly important to ensure system level reliability.
See also
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Associations
*
International System Safety Society
References
Notes
Sources
*
*
*
*
*
*
*
*
*
*
External links
U.S. Army Pamphlet 385-16 System Safety Management Guide
{{Authority control
Design for X
Reliability engineering
Engineering disciplines