Software Fault Tolerance
   HOME

TheInfoList



OR:

Software fault tolerance is the ability of
computer software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.


Introduction

The only thing constant is change. This is certainly more true of software systems than almost any phenomenon, not all software change in the same way so software
fault tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
methods are designed to overcome execution errors by modifying variable values to create an acceptable
program state In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system. The set of states a system can oc ...
.Ray Giguette and Johnette Hassell, “Toward A Resourceful Method of Software Fault Tolerance”, ACM Southeast regional conference, April, 1999. The need to control software fault is one of the most rising challenges facing software industries today. Fault tolerance must be a key consideration in the early stage of
software development Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components. Software development invol ...
. There exist different mechanisms for software fault tolerance, among which: * Recovery blocks * N-version software * Self-checking software


Operating system failure

Computer applications make a call using the ''
application programming interface An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how t ...
'' (API) to access shared resources, like the keyboard, mouse, screen, disk drive, network, and printer. These can fail in two ways. * Blocked Calls * Faults


Blocked calls

A blocked call is a request for services from the operating system that halts the computer program until results are available. As an example, the TCP call blocks until a response becomes available from a remote server. This occurs every time you perform an action with a web browser. Intensive calculations cause lengthy delays with the same effect as a blocked API call. There are two methods used to handle blocking. * Threads * Timers Threading allows a separate sequence of execution for each API call that can block. This can prevent the overall application from stalling while waiting for a resource. This has the benefit that none of the information about the state of the API call is lost while other activities take place. Threaded languages include the following. Timers allow a blocked call to be interrupted. A periodic timer allows the programmer to emulate threading. Interrupts typically destroy any information related to the state of a blocked API call or intensive calculation, so the programmer must keep track of this information separately. Un-threaded languages include the following. Corrupted state will occur with timers. This is avoided with the following. * Track software state *
Semaphore Semaphore (; ) is the use of an apparatus to create a visual signal transmitted over distance. A semaphore can be performed with devices including: fire, lights, flags, sunlight, and moving arms. Semaphores can be used for telegraphy when arra ...
* Blocking


Faults

Fault are induced by
signals In signal processing, a signal is a function that conveys information about a phenomenon. Any quantity that can vary over space or time can be used as a signal to share messages between observers. The ''IEEE Transactions on Signal Processing'' ...
in POSIX compliant systems, and these signals originate from API calls, from the operating system, and from other applications. Any signal that does not have handler code becomes a fault that causes premature application termination. The handler is a function that is performed on-demand when the application receives a signal. This is called
exception handling In computing and computer programming, exception handling is the process of responding to the occurrence of ''exceptions'' – anomalous or exceptional conditions requiring special processing – during the execution of a program. In general, an ...
. The termination signal is the only signal that cannot be handled. All other signals can be directed to a handler function. Handler functions come in two broad varieties. * Initialized * In-line Initialized handler functions are paired with each signal when the software starts. This causes the handler function to startup when the corresponding signal arrives. This technique can be used with timers to emulate threading. In-line handler functions are associated with a call using specialized syntax. The most familiar is the following used with C++ and Java. :try : :catch :


Hardware failure

Hardware fault tolerance for software requires the following. *
Backup In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", w ...
* Redundancy Backup maintains information in the event that hardware must be replaced. This can be done in one of two ways. * Automatic scheduled backup using software * Manual backup on a regular schedule * Information restore Backup requires an information-restore strategy to make backup information available on a replacement system. The restore process is usually time-consuming, and information will be unavailable until the restore process is complete. Redundancy relies on replicating information on more than one computer computing device so that the recovery delay is brief. This can be achieved using continuous backup to a live system that remains inactive until needed (synchronized backup). This can also be achieved by replicating information as it is created on multiple identical systems, which can eliminate recovery delay.


See also

*
Built-in self-test A built-in self-test (BIST) or built-in test (BIT) is a mechanism that permits a machine to test itself. Engineers design BISTs to meet requirements such as: *high reliability *lower repair cycle times or constraints such as: *limited technic ...
*
Built-in test equipment Built-in test equipment (BITE) for avionics primarily refers to passive fault management and diagnosis equipment built into airborne systems to support maintenance processes. Built-in test equipment includes multimeters, oscilloscopes, discharge pr ...
*
Fault-tolerant design Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
*
Fault-tolerant system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
*
Fault-tolerant computer system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
*Immunity-aware programming *
Logic built-in self-test Logic built-in self-test (or LBIST) is a form of built-in self-test (BIST) in which hardware and/or software is built into integrated circuits allowing them to test their own operation, as opposed to reliance on external automated test equipment. A ...
*
N-version programming ''N''-version programming (NVP), also known as multiversion programming or multiple-version dissimilar software, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the sam ...
*
Safety engineering Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety en ...
* OpenSAF - Service Availability API


References

{{Reflist


Further reading


Software fault tolerance, by Chris Inacio at Carnegie Mellon University (1998)
Software quality Software architecture Fault tolerance