Raft is a consensus algorithm designed as an alternative to the Paxos family of algorithms. It was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features. Raft offers a generic way to distribute a

state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...

across a

cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere * Asteroid cluster, a small ...

of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions. It has a number of open-source reference implementations, with full-specification implementations in Go, C++,

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

, and Scala. It is named after Reliable, Replicated, Redundant, And Fault-Tolerant. Raft is not a Byzantine fault tolerant (BFT) algorithm; the nodes trust the elected leader.

Basics

Raft achieves consensus via an elected leader. A server in a raft cluster is either a ''leader'' or a ''follower'', and can be a ''candidate'' in the precise case of an election (leader unavailable). The leader is responsible for log replication to the followers. It regularly informs the followers of its existence by sending a heartbeat message. Each follower has a timeout (typically between 150 and 300 ms) in which it expects the heartbeat from the leader. The timeout is reset on receiving the heartbeat. If no heartbeat is received the follower changes its status to candidate and starts a leader election.

Approach of the consensus problem in Raft

Raft implements consensus by a leader approach. The cluster has one and only one elected leader which is fully responsible for managing log replication on the other servers of the cluster. It means that the leader can decide on new entries' placement and establishment of data flow between it and the other servers without consulting other servers. A leader leads until it fails or disconnects, in which case surviving servers elect a new leader. The consensus problem is decomposed in Raft into two relatively independent subproblems listed down below.

Leader election

When the existing leader fails or when the algorithm initializes, a new leader needs to be elected. In this case, a new ''term'' starts in the cluster. A term is an arbitrary period of time on the server for which a new leader needs to be elected. Each term starts with a leader election. If the election is completed successfully (i.e. a single leader is elected) the term keeps going with normal operations orchestrated by the new leader. If the election is a failure, a new term starts, with a new election. A leader election is started by a ''candidate'' server. A server becomes a candidate if it receives no communication by the leader over a period called the ''election timeout'', so it assumes there is no acting leader anymore. It starts the election by increasing the term counter, voting for itself as new leader, and sending a message to all other servers requesting their vote. A server will vote only once per term, on a first-come-first-served basis. If a candidate receives a message from another server with a term number larger than the candidate's current term, then the candidate's election is defeated and the candidate changes into a follower and recognizes the leader as legitimate. If a candidate receives a majority of votes, then it becomes the new leader. If neither happens, e.g., because of a split vote, then a new term starts, and a new election begins. Raft uses a randomized election timeout to ensure that split vote problems are resolved quickly. This should reduce the chance of a split vote because servers won't become candidates at the same time: a single server will time out, win the election, then become leader and send heartbeat messages to other servers before any of the followers can become candidates.

Log replication

The leader is responsible for the log replication. It accepts client requests. Each client request consists of a command to be executed by the replicated state machines in the cluster. After being appended to the leader's log as a new entry, each of the requests is forwarded to the followers as AppendEntries messages. In case of unavailability of the followers, the leader retries AppendEntries messages indefinitely, until the log entry is eventually stored by all of the followers. Once the leader receives confirmation from half or more of its followers that the entry has been replicated, the leader applies the entry to its local state machine, and the request is considered ''committed''. This event also commits all previous entries in the leader's log. Once a follower learns that a log entry is committed, it applies the entry to its local state machine. This ensures consistency of the logs between all the servers through the cluster, ensuring that the safety rule of Log Matching is respected. In the case of a leader crash, the logs can be left inconsistent, with some logs from the old leader not being fully replicated through the cluster. The new leader will then handle inconsistency by forcing the followers to duplicate its own log. To do so, for each of its followers, the leader will compare its log with the log from the follower, find the last entry where they agree, then delete all the entries coming after this critical entry in the follower log and replace it with its own log entries. This mechanism will restore log consistency in a cluster subject to failures.

Safety

Safety rules in Raft

Raft guarantees each of these safety properties: * Election safety: at most one leader can be elected in a given term. * Leader append-only: a leader can only append new entries to its logs (it can neither overwrite nor delete entries). * Log matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. * Leader completeness: if a log entry is committed in a given term then it will be present in the logs of the leaders since this term. * State machine safety: if a server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log. The first four rules are guaranteed by the details of the algorithm described in the previous section. The State Machine Safety is guaranteed by a restriction on the election process.

State machine safety

This rule is ensured by a simple restriction: a candidate can't win an election unless its log contains all committed entries. In order to be elected, a candidate has to contact a majority of the cluster, and given the rules for logs to be committed, it means that every committed entry is going to be present on at least one of the servers the candidates contact. Raft determines which of two logs (carried by two distinct servers) is more up-to-date by comparing the index term of the last entries in the logs. If the logs have a last entry with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date. In Raft, the request from a candidate to a voter includes information about the candidate's log. If its own log is more up-to-date than the candidate's log, the voter denies its vote to the candidate. This implementation ensures the State Machine Safety rule.

Follower crashes

If a follower crashes, ''AppendEntries'' and ''vote'' requests sent by other servers will fail. Such failures are handled by the servers trying indefinitely to reach the downed follower. If the follower restarts, the pending requests will complete. If the request has already been taken into account before the failure, the restarted follower will just ignore it.

Timing and availability

Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the ''timing requirement'' of the algorithm:

''broadcastTime << electionTimeout << MTBF''

* ''broadcastTime'' is the average time it takes a server to send a request to every server in the cluster and receive responses. It is relative to the infrastructure used. * ''MTBF (Mean Time Between Failures)'' is the average time between failures for a server. It is also relative to the infrastructure. * ''electionTimeout'' is the same as described in the Leader Election section. It is something the programmer must choose. Typical numbers for these values can be 0.5 ms to 20 ms for ''broadcastTime'', which implies that the programmer sets the ''electionTimeout'' somewhere between 10 ms and 500 ms. It can take several weeks or months between single server failures, which means the values are sufficient for a stable cluster.

Extensions

The dissertation “Consensus: Bridging Theory and Practice” by one of the co-authors of the original paper describes extensions to the original algorithm: * Pre-Vote: when a member rejoins the cluster, it can depending on timing trigger an election although there is already a leader. To avoid this, pre-vote will first check in with the other members. Avoiding the unnecessary election improves the availability of cluster, therefore this extension is usually present in production implementations. * Leadership transfer: a leader that is shutting down orderly can explicitly transfer the leadership to another member. This can be faster than waiting for a timeout. Also, a leader can step down when another member would be a better leader, for example when that member is on a faster machine.

Production use of Raft

CockroachDB CockroachDB is a source-available distributed SQL database management system developed by Cockroach Labs. The relational functionality is built on top of a distributed, transactional, consistent key-value store that can survive a variety of d ...

uses Raft in the Replication Layer. *

Etcd etcd is a key-value database commonly deployed with distributed systems. The software is used by Kubernetes. It is written in the Go programming language and published under the Apache License 2.0. History etcd was originally developed as ...

uses Raft to manage a highly-available replicated log *

Hazelcast In computing, Hazelcast is a unified real-time data platform implemented in Java that combines a fast data store with stream processing. It is also the name of the company that develops the product. The Hazelcast company is funded by venture ca ...

uses Raft to provide its CP Subsystem, a strongly consistent layer for distributed data structures. *

MongoDB MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...

uses a variant of Raft in the replication set. *

Neo4j Neo4j is a graph database management system (GDBMS) developed by Neo4j Inc. The data elements Neo4j stores are nodes, edges connecting them, and attributes of nodes and edges. Described by its developers as an ACID-compliant transactional d ...

uses Raft to ensure consistency and safety. *

RabbitMQ RabbitMQ is an open-source message-broker software (sometimes called message-oriented middleware) that originally implemented the Advanced Message Queuing Protocol (AMQP) and has since been extended with a plug-in architecture to support Str ...

uses Raft to implement durable, replicated FIFO queues. *

ScyllaDB ScyllaDB is a source-available distributed NoSQL wide-column data store. It was designed to be compatible with Apache Cassandra while achieving significantly higher throughputs and lower latencies. It supports the same protocols as Cassandra ( CQ ...

uses Raft for metadata (schema and topology changes) *

Splunk Splunk Inc. is an American software company based in San Francisco, California, that produces software for searching, monitoring, and analyzing machine-generated data via a web-style interface. Its software helps capture, index and correlate re ...

Enterprise uses Raft in a Search Head Cluster (SHC) *

TiDB TiDB (/’taɪdiːbi:/, "Ti" stands for Titanium) is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing ( HTAP) workloads. Designed to be MySQL compatible, it is developed and supported primarily by Ping ...

uses Raft with the storage engine TiKV. *

YugabyteDB YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte. History Yugabyte was founded by ex-Facebook engineers Kannan Muthukkaruppan, Karthik Ranganathan, and Mikhail Bau ...

uses Raft in the DocDB Replication *

ClickHouse ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in ...

uses Raft for in-house implementation of

ZooKeeper A zookeeper, sometimes referred as animal keeper, is a person who manages zoo animals that are kept in captivity for conservation or to be displayed to the public.Hurwitz, Jane. Choosing a Career in Animal Care (World of Work). New York: Rosen Gr ...

-like service * Redpanda uses the Raft consensus algorithm for data replication *

Apache Kafka Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency pl ...

Raft (KRaft) uses Raft for metadata management. *

NATS Messaging NATS is an open-source messaging system (sometimes called message-oriented middleware). The NATS server is written in the Go programming language. Client libraries to interface with the server are available for dozens of major programming langua ...

uses the Raft consensus algorithm for Jetstream cluster management and data replication *

Camunda Camunda is a process orchestration platform used to control complex business processes for enterprise companies. The software is classified by specialist media as a business process automation tool or digital process automation software and ther ...

uses the Raft consensus algorithm for data replication

References

External links

* {{DEFAULTSORT:Raft Algorithm Distributed algorithms Fault-tolerant computer systems