Self-stabilization is a concept of

fault-tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...

distributed systems A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...

. Given any initial state, a self-stabilizing distributed system will end up in a correct

state State may refer to: Arts, entertainment, and media Literature * ''State Magazine'', a monthly magazine published by the U.S. Department of State * ''The State'' (newspaper), a daily newspaper in Columbia, South Carolina, United States * ''Our S ...

in a finite number of

execution Capital punishment, also known as the death penalty, is the state-sanctioned practice of deliberately killing a person as a punishment for an actual or supposed crime, usually following an authorized, rule-governed process to conclude that ...

steps. At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault-tolerance of algorithms, that aim to guarantee that the system always remains in a correct state under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self-stabilization are incorporated into many modern

computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations ( computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...

and

telecommunications Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that ...

networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm. Many years after the seminal paper of

Edsger Dijkstra Edsger Wybe Dijkstra ( ; ; 11 May 1930 – 6 August 2002) was a Dutch computer scientist, programmer, software engineer, systems scientist, and science essayist. He received the 1972 Turing Award for fundamental contributions to developing progra ...

in 1974, this concept remains important as it presents an important foundation for self-managing computer systems and

fault-tolerant system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...

s. As a result, Dijkstra's paper received the 2002 ACM

PODC Influential-Paper Award The Edsger W. Dijkstra Paper Prize in Distributed Computing is given for outstanding papers on the principles of distributed computing, whose significance and impact on the theory and/or practice of distributed computing has been evident for at lea ...

, one of the highest recognitions in the distributed computing community. Moreover, after Dijkstra's death, the award was renamed and is now called the Dijkstra Award.

History

E.W. Dijkstra in 1974 presented the concept of self-stabilization, prompting further research in this area. His demonstration involved the presentation of self-stabilizing

mutual exclusion In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurren ...

algorithms. It also showed the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition. It was only ten years later when

Leslie Lamport Leslie B. Lamport (born February 7, 1941 in Brooklyn) is an American computer scientist and mathematician. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX an ...

pointed out the importance of Dijkstra's work at a 1983 conference called

Symposium on Principles of Distributed Computing The Symposium on Principles of Distributed Computing (PODC) is an academic conference in the field of distributed computing organised annually by the Association for Computing Machinery (special interest groups SIGACT and SIGOPS). Scope and re ...

that researchers directed their attention to this elegant fault-tolerance concept. In his talk, Lamport stated:

I regard this as Dijkstra's most brilliant work - at least, his most brilliant published paper. It's almost completely unknown. I regard it to be a milestone in work on fault tolerance... I regard self-stabilization to be a very important concept in fault tolerance and to be a very fertile field for research.

Afterwards, Dijkstra's work was awarded ACM-PODC influential paper award, which then became ACM's (the Association for computing Machinery) Dijkstra Prize in Distributed Computing given at the annual ACM-PODC symposium.

Overview

distributed algorithm A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as telecommunications, scientific ...

is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if, starting from this state, the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly regardless of its initial state. Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring"—a network of computers ordered in a circle. Here, each computer or processor can "see" the whole state of one processor that immediately precedes it and that this state may imply that the processor "has a token" or it "does not have a token." One of the requirements is that exactly one of them must "hold a token" at any given time. The second requirement prescribes that each node "passes the token" to the computer/processor succeeding it so that the token eventually circulates the ring. * Not holding a token is a correct state for each computer in this network, since the token can be held by another computer. However, if every computer is in the state of "not holding a token" then the network altogether is not in a correct state. * Similarly, if more than one computer "holds a token" then this is not a correct state for the network, although it cannot be observed to be incorrect by viewing any computer individually. Since every computer can "observe" only the states of its two neighbors, it is hard for the computers to decide whether the network altogether is in a correct state. The first self-stabilizing

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

s did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state. Since traditional methods for detecting an error were often very difficult and time-consuming, such a behavior was considered desirable. (The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task).

Efficiency improvements

More recently, researchers have presented newer methods for light-weight error detection for self-stabilizing systems using local checking.. and for general tasks. The term ''local'' refers to a part of a computer network. When local detection is used, a computer in a network is not required to communicate with the entire network in order to detect an error—the error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mechanism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods also turned out to be much more efficient. Moreover, these papers suggested rather efficient general transformers to transform non self stabilizing algorithms to become self stabilizing. The idea is to, # Run the non self stabilizing protocol, at the same time, # detect faults (during the execution of the given protocol) using the above-mentioned detection methods, # then, apply a (self stabilizing) "reset" protocol to return the system to some predetermined initial state, and, finally, # restart the given (non- self stabilizing) protocol. The combination of these 4 parts is self stabilizing (as long as there is no trigger of fault during the correction fault phases, e.g.,Baruch Awerbuch, Boaz Patt-Shamir, George Varghese,

Shlomi Dolev Shlomi Dolev ( he, שלומי דולב; born December 5, 1958) is a Rita Altura Trust Chair Professor in Computer Science at Ben-Gurion University of the Negev (BGU) and the head of the BGU Negev Hi-Tech Faculty Startup Accelerato Biography Shlom ...

. Self-Stabilization by Local Checking and Global Reset. WDAG 1994: 326-339.). Initial self stabilizing protocols were also presented in the above papers. More efficient reset protocols were presented later, e.g. aruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-Shamir, George Varghese. Time optimal self-stabilizing synchronization. ACM STOC 1993: 652-661.] Additional efficiency was introduced with the notion of time-adaptive protocols. The idea behind these is that when only a small number of errors occurs, the recovery time can (and should) be made short. Dijkstra's original self-stabilization algorithms do not have this property. A useful property of self-stabilizing algorithms is that they can be composed of layers if the layers do not exhibit any

circular dependencies In software engineering, a circular dependency is a relation between two or more modules which either directly or indirectly depend on each other to function properly. Such modules are also known as mutually recursive. Overview Circular depen ...

. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer.

Shlomo Moran Shlomo Moran ( he, שלמה מורן; born 1947) is an Israeli computer scientist, the Bernard Elkin Chair in Computer Science at the Technion – Israel Institute of Technology in Haifa, Israel. Moran received his Ph.D. in 1979 from the Techni ...

, Amos Israeli
Self-Stabilization of Dynamic Systems Assuming only Read/Write Atomicity.
Distributed Computing, volume 7, pages3–16(1993). New approaches to Dijkstra's work emerged later on such as the case of Krzysztof Apt and Ehsan Shoja's proposition, which demonstrated how self-stabilization can be naturally formulated using the standard concepts of strategic games, particularly the concept of an improvement path. This particular work sought to demonstrate the link between self-stabilization and game theory.

Time complexity

The time

complexity Complexity characterises the behaviour of a system or model whose components interact in multiple ways and follow local rules, leading to nonlinearity, randomness, collective dynamics, hierarchy, and emergence. The term is generally used to ch ...

of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles. * A ''round'' is the shortest execution trace in which each processor executes at least one step. * Similarly, a ''cycle'' is the shortest execution trace in which each processor executes at least one complete iteration of its repeatedly executed list of commands. To measure the output stabilization time, a subset of the state variables is defined to be externally visible (the ''output''). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) ''rounds'') until the output stabilizes..

Definition

A system is self-stabilizing if and only if: # Starting from any state, it is guaranteed that the system will eventually reach a correct state (''convergence''). # Given that the system is in a correct state, it is guaranteed to stay in a correct state, provided that no fault happens (''closure''). A system is said to be ''randomized self-stabilizing'' if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant

k

. Design of self-stabilization in the above-mentioned sense is well known to be a difficult job. In fact, a class of distributed algorithms do not have the property of local checking: the legitimacy of the network state cannot be evaluated by a single process. The most obvious case is Dijkstra's token-ring defined above: no process can detect whether the network state is legitimate or not in the case where more than one token is present in non-neighboring processes. This suggests that self-stabilization of a distributed system is a sort of

collective intelligence Collective intelligence (CI) is shared or group intelligence (GI) that emerges from the collaboration, collective efforts, and competition of many individuals and appears in consensus decision making. The term appears in sociobiology, politi ...

where each component is taking local actions, based on its local knowledge but eventually this guarantees global convergence at the end. To help overcome the difficulty of designing self-stabilization as defined above, other types of stabilization were devised. For instance, ''weak stabilization'' is the property that a distributed system has a possibility to reach its legitimate behavior from every possible state. Weak stabilization is easier to design as it just guarantees a ''possibility'' of convergence for some runs of the distributed system rather than convergence for every run. A self-stabilizing algorithm is ''silent'' if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed.

Related work

An extension of the concept of self-stabilization is that of

superstabilization Superstabilization is a concept of fault-tolerance in distributed computing. Superstabilizing distributed algorithms combine the features of self-stabilizing algorithms and dynamic algorithms. A superstabilizing algorithm – just like any othe ...

., article 4. The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a ''passage'' predicate that is always satisfied while the system's topology is reconfigured.

References

External links

libcircle
- An implementation of self-stabilization using token passing for termination. {{DEFAULTSORT:Self-Stabilization Distributed computing problems Fault-tolerant computer systems Edsger W. Dijkstra Dutch inventions