Protocol
The protocol has two components, the ''Failure Detector Component'' and the ''Dissemination Component''. The ''Failure Detector Component'' functions as follows: # Every ''T time units, each node () sends a ping to random other node () in its membership list. # If receives a response from , is decided to be healthy and N1 updates its "last heard from" timestamp for to be the current time. # If does not receive a response, contacts ''k'' other nodes on its list (), and requests that they ping . # If after ''T units of time: if no successful response is received, marks as failed. The ''Dissemination Component'' functions as follows: * Upon detecting a failed node , sends aProperties
The protocol provides the following guarantees: * Strong Completeness: Full completeness is guaranteed (e.g. the crash-failure of any node in the group is eventually detected by all live nodes). * Detection Time: The expected value of detection time (from node failure to detection) is , where is the length of the protocol period, and is the fraction of non-faulty nodes in the group.Extensions
The original SWIM paper lists the following extensions to make the protocol more robust: * Suspicion: Nodes that are unresponsive to ping messages are not initially marked as failed. Instead, they are marked as "suspicious"; nodes which discover a "suspicious" node still send a multicast to all other nodes including this mechanism. If a "suspicious" node responds to a ping before some time-out threshold, an "alive" message is sent via multicast to remove the "suspicious" label from the node. * Infection-Style Dissemination: Instead of propagating node failure information via multicast, protocol messages are piggybacked on the ping messages used to determine node liveness. This is equivalent to gossip dissemination. * Round-Robin Probe Target Selection: Instead of randomly picking a node to probe during each protocol time step, the protocol is modified so that each node performs a round-robin selection of probe target. This bounds the worst-case detection time of the protocol, without degrading the average detection time.See Also
*References