HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a heartbeat is a
periodic signal A periodic function is a function that repeats its values at regular intervals. For example, the trigonometric functions, which repeat at intervals of 2\pi radians, are periodic functions. Periodic functions are used throughout science to des ...
generated by hardware or
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
to indicate normal operation or to synchronize other parts of a
computer system A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progr ...
. Heartbeat mechanism is one of the common techniques in mission critical systems for providing
high availability High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. Fo ...
and
fault tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
of
network service In computer networking, a network service is an application running at the network application layer and above, that provides data storage, manipulation, presentation, communication or other capability which is often implemented using a client� ...
s by detecting the
network Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...
or
systems failure ''Systems Failure'' is a role-playing game written by Bill Coffin and published by Palladium Books in July 1999. Contents The fictional premise for the game is that during the "Millennium bug" scare, actual "Bugs" appeared. They are energy bein ...
s of
nodes In general, a node is a localized swelling (a "knot") or a point of intersection (a Vertex (graph theory), vertex). Node may refer to: In mathematics *Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two ...
or daemons which belongs to a network cluster—administered by a master server—for the purpose of automatic adaptation and
rebalancing In finance and investing, rebalancing of investments (or constant mix) is a strategy of bringing a portfolio that has deviated away from one's target asset allocation back into line. This can be implemented by transferring assets, that is, selling ...
of the system by using the remaining redundant nodes on the cluster to take over the load of failed nodes for providing constant services. Usually a heartbeat is sent between machines at a regular interval in the order of seconds; a heartbeat message. If the endpoint does not receive a heartbeat for a time—usually a few heartbeat intervals—the machine that should have sent the heartbeat is assumed to have failed. Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown. When the destination identifies a lack of heartbeat messages during an anticipated arrival period, the destination may determine that the originator has failed, shutdown, or is generally no longer available.


Heartbeat protocol

A heartbeat protocol is generally used to negotiate and monitor the availability of a resource, such as a floating IP address, and the procedure involves sending
network packet In telecommunications and computer networking, a network packet is a formatted unit of data carried by a packet-switched network. A packet consists of control information and user data; the latter is also known as the ''payload''. Control informa ...
s to all the nodes in the cluster to verify its
reachability In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex s can reach a vertex t (and t is reachable from s) if there exists a sequence of adjacent vertices (i.e. a walk) which starts with s an ...
. Typically when a heartbeat starts on a machine, it will perform an election process with other machines on the
heartbeat network In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system. Heartbeat mechanism is one of the common techniques in mission critical syste ...
to determine which machine, if any, owns the resource. On heartbeat networks of more than two machines, it is important to take into account partitioning, where two halves of the network could be functioning but not able to communicate with each other. In a situation such as this, it is important that the resource is only owned by one machine, not one machine in each partition. As a heartbeat is intended to be used to indicate the health of a machine, it is important that the heartbeat protocol and the transport that it runs on are as reliable as possible. Causing a
failover Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer net ...
because of a false alarm may, depending on the resource, be highly undesirable. It is also important to react quickly to an actual failure, further signifying the reliability of the heartbeat messages. For this reason, it is often desirable to have a heartbeat running over more than one transport; for instance, an
Ethernet Ethernet () is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 198 ...
segment using UDP/ IP, and a serial link. A "cluster membership" of a node is a property of network reachability: if the master can communicate with the node x, it's considered a member of the cluster and "dead" otherwise. A heartbeat program as a whole consist of various subsystems: * Heartbeat Subsystem (HS): The subsystem that monitors the node's presence within the cluster through a series of keepalive or "hear-beat messages". * Cluster Manager (CM): The subsystem within the cluster—usually the master server—which keeps track of the "cluster members" and records which
resources Resource refers to all the materials available in our environment which are technologically accessible, economically feasible and culturally sustainable and help us to satisfy our needs and wants. Resources can broadly be classified upon their a ...
are on which nodes. * Cluster Transition (CT): When a node joins or leaves the cluster, this subsystem is responsible for keeping track of such occurrences for the purpose of triggering events to rebalancing and reconfiguring the master to distribute the load. Heartbeat messages are sent in a periodic manner through techniques such as
broadcast Broadcasting is the distribution of audio or video content to a dispersed audience via any electronic mass communications medium, but typically one using the electromagnetic spectrum ( radio waves), in a one-to-many model. Broadcasting began ...
or
multicast In computer networking, multicast is group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast should not be confused with ...
s in larger clusters. Since CMs have transactions across the cluster, the most common pattern is to send heartbeat messages to all the nodes and "
await In computer programming, the async/await pattern is a syntactic feature of many programming languages that allows an asynchronous, non-blocking function to be structured in a way similar to an ordinary synchronous function. It is semantically rel ...
" responses in non-blocking fashion. Since the heartbeat or keepalive messages are the overwhelming majority of non-application related cluster control messages—which also goes to all the members of the cluster—major critical systems also include non- IP protocols like
serial port In computing, a serial port is a serial communication interface through which information transfers in or out sequentially one bit at a time. This is in contrast to a parallel port, which communicates multiple bits simultaneously in parallel. ...
s to deliver heartbeats.


Design and implementation

Every CM on the master server maintains a
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
with three states for each node it administers: Down, Init, and Alive. Whenever a new node joins, the CM changes the state of the node from Down to Init and broadcasts a "boot-up message", which the node receives the executes set of start-up procedures. It then responses with an acknowledgment message, CM then includes the node as the member of the cluster and transitions the state of the node from Init to Alive. Every node in the Alive state would receive a periodic broadcast heartbeat message from the HS subsystem and expects an acknowledgment message back within a timeout range. If CM didn't receive an acknowledgment heartbeat message back, the node is considered unavailable, and a state transition from Alive to Down takes place for that node by CM. The procedures or scripts to run, and actions to take between each state transition is an implementation detail of the system.


Heartbeat network

Heartbeat network is a
private network In Internet networking, a private network is a computer network that uses a private address space of IP addresses. These addresses are commonly used for local area networks (LANs) in residential, office, and enterprise environments. Both the IPv4 ...
which is shared only by the nodes in the cluster, and is not accessible from outside the cluster. It is used by cluster nodes in order to monitor each node's status and communicate with each other messages necessary for maintaining the operation of the cluster. The heartbeat method uses the FIFO nature of the signals sent across the network. By making sure that all messages have been received, the system ensures that events can be properly ordered. In this
communications protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchr ...
every node sends back a message in a given interval, say
delta Delta commonly refers to: * Delta (letter) (Δ or δ), a letter of the Greek alphabet * River delta, at a river mouth * D ( NATO phonetic alphabet: "Delta") * Delta Air Lines, US * Delta variant of SARS-CoV-2 that causes COVID-19 Delta may also ...
, in effect confirming that it is alive and has a heartbeat. These messages are viewed as control messages that help determine that the network includes no delayed messages. A receiver node called a "sync", maintains an ordered list of the received messages. Once a message with a
timestamp A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second. Timestamps do not have to be based on some absolut ...
later than the given marked time is received from every node, the system determines that all messages have been received since the FIFO property ensures that the messages are ordered. In general, it is difficult to select a delta that is optimal for all applications. If delta is too small, it requires too much overhead and if it is large it results in performance degradation as everything waits for the next heartbeat signal.


See also

*
Watchdog timer A watchdog timer (sometimes called a ''computer operating properly'' or ''COP'' timer, or simply a ''watchdog'') is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in ...
, electronic timer that is used to detect and recover from computer malfunctions *
Heartbleed Heartbleed was a security bug in the OpenSSL cryptography library, which is a widely used implementation of the Transport Layer Security (TLS) protocol. It was introduced into the software in 2012 and publicly disclosed in April 2014. Heartble ...
vulnerability


Notes


References

* * * * {{cite conference, doi= 10.1109/CASE.2009.115, isbn= 978-0-7695-3728-3, first1=Fei-Fei, last1=Li, first2=Xiang-Zhan, last2=Yu, first3=Gang, last3=Wu, title=Design and Implementation of High Availability Distributed System Based on Multi-level Heartbeat Protocol, conference= 2009 IITA International Conference on Control, Automation and Systems Engineering (case 2009), date=11 July 2009, publisher=
IEEE The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
, location=China, url=https://ieeexplore.ieee.org/document/5194396 Embedded systems Cluster computing Reliability engineering Data transmission Network architecture Network protocols