History
The topic predates the protocol. In 1988,Assumptions
In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article.Processors
* Processors operate at arbitrary speed. * Processors may experience failures. * Processors with stable storage may re-join the protocol after failures (following a crash-recovery failure model). * Processors do not collude, lie, or otherwise attempt to subvert the protocol. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution that tolerates failures that arise from arbitrary/malicious behavior of the processes.)Network
* Processors can send messages to any other processor. * Messages are sent asynchronously and may take arbitrarily long to deliver. * Messages may be lost, reordered, or duplicated. * Messages are delivered without corruption. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution which tolerates corrupted messages that arise from arbitrary/malicious behavior of the messaging channels.)Number of processors
In general, a consensus algorithm can make progress using processors, despite the simultaneous failure of any processors: in other words, the number of non-faulty processes must be strictly greater than the number of faulty processes. However, using reconfiguration, a protocol may be employed which survives any number of total failures as long as no more than F fail simultaneously. For Paxos protocols, these reconfigurations can be handled as separate ''configurations''.Roles
Paxos describes the actions of the processors by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol. ; Client: The Client issues a ''request'' to the distributed system, and waits for a ''response''. For instance, a write request on a file in a distributed file server. ; Acceptor (Voters): The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum. ; Proposer: A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur. ; Learner: Learners act as the replication factor for the protocol. Once a Client request has been agreed upon by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added. ; Leader: Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved in that case.Quorums
Quorums express the safety (or consistency) properties of Paxos by ensuring at least some surviving processor retains knowledge of the results. Quorums are defined as subsets of the set of Acceptors such that any two Quorums share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors , a majority Quorum would be any three Acceptors: , , , . More generally, arbitrary positive weights can be assigned to Acceptors; in that case, a Quorum can be defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.Proposal number and agreed value
Each attempt to define an agreed value ''v'' is performed with proposals which may or may not be accepted by Acceptors. Each proposal is uniquely numbered for a given Proposer. So, e.g., each proposal may be of the form ''(n, v)'', where ''n'' is the unique identifier of the proposal and ''v'' is the actual proposed value. The value corresponding to a numbered proposal can be computed as part of running the Paxos protocol, but need not be.Safety and liveness properties
In order to guarantee ''safety'' (also called "consistency"), Paxos defines three properties and ensures the first two are always held, regardless of the pattern of failures: ; Validity (or ''non-triviality''): Only proposed values can be chosen and learned. ; Agreement (or ''consistency'', or ''safety''): No two distinct learners can learn different values (or there can't be more than one decided value) ; Termination (or liveness): If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty). Note that Paxos is ''not'' guaranteed to terminate, and thus does not have the liveness property. This is supported by the Fischer Lynch Paterson impossibility result (FLP) which states that a consistency protocol can only have two of ''safety'', ''liveness'', and ''fault tolerance''. As Paxos's point is to ensure fault tolerance and it guarantees safety, it cannot also guarantee liveness.Typical deployment
In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner. This reduces the message complexity significantly, without sacrificing correctness: By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties. A typical implementation's message flow is covered in the section Multi-Paxos.Basic Paxos
This protocol is the most basic of the Paxos family. Each "instance" (or "execution") of the basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has 2 phases: phase 1 (which is divided into parts ''a'' and ''b'') and phase 2 (which is divided into parts ''a'' and ''b''). See below the description of the phases. Remember that we assume an asynchronous model, so e.g. a processor may be in one phase while another processor may be in another.Phase 1
Phase 1a: ''Prepare''
: A Proposer creates a message, which we call a "Prepare", identified with a number ''n''. Note that ''n'' is not the value to be proposed and maybe agreed on, but just a number which uniquely identifies this initial message by the proposer (to be sent to the acceptors). The number ''n'' must be greater than any number used in any of the previous ''Prepare'' messages by this Proposer. Then, it sends the ''Prepare'' message containing ''n'' to at least aPhase 1b: ''Promise''
: Any of the ''Acceptors'' waits for a ''Prepare'' message from any of the ''Proposers''. If an Acceptor receives a Prepare message, the Acceptor must look at the identifier number ''n'' of the just received ''Prepare'' message. There are two cases. ::If ''n'' is higher than every previous proposal number received, from any of the Proposers, by the Acceptor, then the Acceptor must return a message, which we call a "Promise", to the Proposer, to ignore all future proposals having a number less than ''n''. If the Acceptor ''accepted'' a proposal at some point in the past, it must include the previous proposal number, say ''m'', and the corresponding accepted value, say ''w'', in its response to the Proposer. :: Otherwise (that is, ''n'' is less than or equal to any previous proposal number received from any Proposer by the Acceptor) the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (''Phase 2
Phase 2a: ''Accept''
: If a Proposer receives Promises from a Quorum of Acceptors, it needs to set a value ''v'' to its proposal. If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal, ''v'', to the value associated with the highest proposal number reported by the Acceptors, let's call it ''z''. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose the value it originally wanted to propose, say ''x''. : The Proposer sends an ''Accept'' message, ''(n, v)'', to a Quorum of Acceptors with the chosen value for its proposal, v, and the proposal number ''n'' (which is the same as the number contained in the ''Prepare'' message previously sent to the Acceptors). So, the ''Accept'' message is either ''(n, v=z)'' or, in case none of the Acceptors previously accepted a value, ''(n, v=x)''. This ''Accept'' message should be interpreted as a "request", as in "Accept this proposal, please!".Phase 2b: ''Accepted''
: If an Acceptor receives an Accept message, ''(n, v)'', from a Proposer, it must accept itWhen rounds fail
: Rounds fail when multiple Proposers send conflicting ''Prepare'' messages, or when the Proposer does not receive a Quorum of responses (''Promise'' or ''Accepted''). In these cases, another round must be started with a higher proposal number.Paxos can be used to select a leader
Notice that a Proposer in Paxos could propose "I am the leader," (or, for example, "Proposer X is the leader"). Because of the agreement and validity guarantees of Paxos, if accepted by a Quorum, then the Proposer is now known to be the leader to all other nodes. This satisfies the needs of leader election because there is a single node believing it is the leader and a single node known to be the leader at all times.Graphic representation of the flow of messages in the basic Paxos
The following diagrams represent several cases/situations of the application of the Basic Paxos protocol. Some cases show how the Basic Paxos protocol copes with the failure of certain (redundant) components of the distributed system. Note that the values returned in the ''Promise'' message are "null" the first time a proposal is made (since no Acceptor has accepted a value before in this round).Basic Paxos without failures
In the diagram below, there is 1 Client, 1 Proposer, 3 Acceptors (i.e. the Quorum size is 3) and 2 Learners (represented by the 2 vertical lines). This diagram represents the case of a first round, which is successful (i.e. no process in the network fails).Error cases in basic Paxos
The simplest error cases are the failure of an Acceptor (when a Quorum of Acceptors remains alive) and failure of a redundant Learner. In these cases, the protocol requires no "recovery" (i.e. it still succeeds): no additional rounds or messages are required, as shown below (in the next two diagrams/cases).Basic Paxos when an Acceptor fails
In the following diagram, one of the Acceptors in the Quorum fails, so the Quorum size becomes 2. In this case, the Basic Paxos protocol still succeeds.Client Proposer Acceptor Learner , , , , , , , X-------->, , , , , , Request , X--------->, ->, ->, , , Prepare(1) , , , , ! , , !! FAIL !! , , <---------X--X , , Promise(1,) , X--------->, ->, , , Accept!(1,V) , , <---------X--X--------->, ->, Accepted(1,V) , <---------------------------------X--X Response , , , , , ,
Basic Paxos when a redundant learner fails
In the following case, one of the (redundant) Learners fails, but the Basic Paxos protocol still succeeds.Client Proposer Acceptor Learner , , , , , , , X-------->, , , , , , Request , X--------->, ->, ->, , , Prepare(1) , , <---------X--X--X , , Promise(1,) , X--------->, ->, ->, , , Accept!(1,V) , , <---------X--X--X------>, ->, Accepted(1,V) , , , , , , ! !! FAIL !! , <---------------------------------X Response , , , , , ,
Basic Paxos when a Proposer fails
In this case, a Proposer fails after proposing a value, but before the agreement is reached. Specifically, it fails in the middle of the Accept message, so only one Acceptor of the Quorum receives the value. Meanwhile, a new Leader (a Proposer) is elected (but this is not shown in detail). Note that there are 2 rounds in this case (rounds proceed vertically, from the top to the bottom).Client Proposer Acceptor Learner , , , , , , , X----->, , , , , , Request , X------------>, ->, ->, , , Prepare(1) , , <------------X--X--X , , Promise(1,) , , , , , , , , , , , , , , !! Leader fails during broadcast !! , X------------>, , , , , Accept!(1,V) , ! , , , , , , , , , , , , !! NEW LEADER !! , X--------->, ->, ->, , , Prepare(2) , , <---------X--X--X , , Promise(2,) , X--------->, ->, ->, , , Accept!(2,V) , , <---------X--X--X------>, ->, Accepted(2,V) , <---------------------------------X--X Response , , , , , , ,
Basic Paxos when multiple Proposers conflict
The most complex case is when multiple Proposers believe themselves to be Leaders. For instance, the current leader may fail and later recover, but the other Proposers have already re-selected a new leader. The recovered leader has not learned this yet and attempts to begin one round in conflict with the current leader. In the diagram below, 4 unsuccessful rounds are shown, but there could be more (as suggested at the bottom of the diagram).Client Proposer Acceptor Learner , , , , , , , X----->, , , , , , Request , X------------>, ->, ->, , , Prepare(1) , , <------------X--X--X , , Promise(1,) , ! , , , , , !! LEADER FAILS , , , , , , , !! NEW LEADER (knows last number was 1) , X--------->, ->, ->, , , Prepare(2) , , <---------X--X--X , , Promise(2,) , , , , , , , , !! OLD LEADER recovers , , , , , , , , !! OLD LEADER tries 2, denied , X------------>, ->, ->, , , Prepare(2) , , <------------X--X--X , , Nack(2) , , , , , , , , !! OLD LEADER tries 3 , X------------>, ->, ->, , , Prepare(3) , , <------------X--X--X , , Promise(3,) , , , , , , , , !! NEW LEADER proposes, denied , , X--------->, ->, ->, , , Accept!(2,Va) , , , <---------X--X--X , , Nack(3) , , , , , , , , !! NEW LEADER tries 4 , , X--------->, ->, ->, , , Prepare(4) , , , <---------X--X--X , , Promise(4,) , , , , , , , , !! OLD LEADER proposes, denied , X------------>, ->, ->, , , Accept!(3,Vb) , , <------------X--X--X , , Nack(4) , , , , , , , , ... and so on ...
Basic Paxos where an Acceptor accepts Two Different Values
In the following case, one Proposer achieves acceptance of value V1 by one Acceptor before failing. A new Proposer prepares the Acceptors that never accepted V1, allowing it to propose V2. Then V2 is accepted by all Acceptors, including the one that initially accepted V1.Proposer Acceptor Learner , , , , , , , X--------->, ->, ->, , , Prepare(1) , <---------X--X--X , , Promise(1,) x--------->, , , , , Accept!(1,V1) , , X------------>, ->, Accepted(1,V1) ! , , , , , , !! FAIL !! , , , , , , X--------->, ->, , , Prepare(2) , <---------X--X , , Promise(2,) X------>, ->, ->, , , Accept!(2,V2) , <------X--X--X------>, ->, Accepted(2,V2) , , , , , ,
Basic Paxos where a multi-identifier majority is insufficient
In the following case, one Proposer achieves acceptance of value V1 of one Acceptor before failing. A new Proposer prepares the Acceptors that never accepted V1, allowing it to propose V2. This Proposer is able to get one Acceptor to accept V2 before failing. A new Proposer finds a majority that includes the Acceptor that has accepted V1, and must propose it. The Proposer manages to get two Acceptors to accept it before failing. At this point, three Acceptors have accepted V1, but not for the same identifier. Finally, a new Proposer prepares the majority that has not seen the largest accepted identifier. The value associated with the largest identifier in that majority is V2, so it must propose it. This Proposer then gets all Acceptors to accept V2, achieving consensus.Proposer Acceptor Learner , , , , , , , , , , , X--------------->, ->, ->, ->, ->, , , Prepare(1) , <---------------X--X--X--X--X , , Promise(1,) x--------------->, , , , , , , Accept!(1,V1) , , , , X------------------>, ->, Accepted(1,V1) ! , , , , , , , , , , !! FAIL !! , , , , , , , , , , X--------------->, ->, ->, ->, , , Prepare(2) , <---------------X--X--X--X , , Promise(2,) X--------------->, , , , , , Accept!(2,V2) , , , , X--------------->, ->, Accepted(2,V2) ! , , , , , , , , , !! FAIL !! , , , , , , , , , X--------->, ---->, ->, ->, , , Prepare(3) , <---------X-----X--X--X , , Promise(3,) X--------------->, ->, , , , Accept!(3,V1) , , , , X--X--------->, ->, Accepted(3,V1) ! , , , , , , , , !! FAIL !! , , , , , , , , X------>, ->, ------->, , , Prepare(4) , <------X--X--, --, --X , , Promise(4,) X------>, ->, ->, ->, ->, , , Accept!(4,V2) , X--X--X--X--X------>, ->, Accepted(4,V2)
Basic Paxos where new Proposers cannot change an existing consensus
In the following case, one Proposer achieves acceptance of value V1 of two Acceptors before failing. A new Proposer may start another round, but it is now impossible for that proposer to prepare a majority that doesn't include at least one Acceptor that has accepted V1. As such, even though the Proposer doesn't see the existing consensus, the Proposer's only option is to propose the value already agreed upon. New Proposers can continually increase the identifier to restart the process, but the consensus can never be changed.Proposer Acceptor Learner , , , , , , , X--------->, ->, ->, , , Prepare(1) , <---------X--X--X , , Promise(1,) x--------->, ->, , , , Accept!(1,V1) , , X--X--------->, ->, Accepted(1,V1) ! , , , , , , !! FAIL !! , , , , , , X--------->, ->, , , Prepare(2) , <---------X--X , , Promise(2,) X------>, ->, ->, , , Accept!(2,V1) , <------X--X--X------>, ->, Accepted(2,V1) , , , , , ,
Multi-Paxos
A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of theGraphic representation of the flow of messages in the Multi-Paxos
Multi-Paxos without failures
In the following diagram, only one instance (or "execution") of the basic Paxos protocol, with an initial Leader (a Proposer), is shown. Note that a Multi-Paxos consists of several instances of the basic Paxos protocol.Client Proposer Acceptor Learner , , , , , , , --- First Request --- X-------->, , , , , , Request , X--------->, ->, ->, , , Prepare(N) , , <---------X--X--X , , Promise(N,I,) , X--------->, ->, ->, , , Accept!(N,I,V) , , <---------X--X--X------>, ->, Accepted(N,I,V) , <---------------------------------X--X Response , , , , , , ,where V = last of (Va, Vb, Vc).
Multi-Paxos when phase 1 can be skipped
In this case, subsequent instances of the basic Paxos protocol (represented by ''I+1'') use the same leader, so the phase 1 (of these subsequent instances of the basic Paxos protocol), which consist of the Prepare and Promise sub-phases, is skipped. Note that the Leader should be stable, i.e. it should not crash or change.Client Proposer Acceptor Learner , , , , , , , --- Following Requests --- X-------->, , , , , , Request , X--------->, ->, ->, , , Accept!(N,I+1,W) , , <---------X--X--X------>, ->, Accepted(N,I+1,W) , <---------------------------------X--X Response , , , , , , ,
Multi-Paxos when roles are collapsed
A common deployment of the Multi-Paxos consists in collapsing the role of the Proposers, Acceptors and Learners to "Servers". So, in the end, there are only "Clients" and "Servers". The following diagram represents the first "instance" of a basic Paxos protocol, when the roles of the Proposer, Acceptor and Learner are collapsed to a single role, called the "Server".Client Servers , , , , --- First Request --- X-------->, , , Request , X->, ->, Prepare(N) , , <-X--X Promise(N, I, ) , X->, ->, Accept!(N, I, Vn) , X<>X<>X Accepted(N, I) , <--------X , , Response , , , ,
Multi-Paxos when roles are collapsed and the leader is steady
In the subsequent instances of the basic Paxos protocol, with the same leader as in the previous instances of the basic Paxos protocol, the phase 1 can be skipped.Client Servers X-------->, , , Request , X->, ->, Accept!(N,I+1,W) , X<>X<>X Accepted(N,I+1) , <--------X , , Response , , , ,
Optimisations
A number of optimisations can be performed to reduce the number of exchanged messages, to improve the performance of the protocol, etc. A few of these optimisations are reported below. : "We can save messages at the cost of an extra message delay by having a single distinguished learner that informs the other learners when it finds out that a value has been chosen. Acceptors then send ''Accepted'' messages only to the distinguished learner. In most applications, the roles of leader and distinguished learner are performed by the same processor. : "A leader can send its ''Prepare'' and ''Accept!'' messages just to a quorum of acceptors. As long as all acceptors in that quorum are working and can communicate with the leader and the learners, there is no need for acceptors not in the quorum to do anything. : "Acceptors do not care what value is chosen. They simply respond to ''Prepare'' and ''Accept!'' messages to ensure that, despite failures, only a single value can be chosen. However, if an acceptor does learn what value has been chosen, it can store the value in stable storage and erase any other information it has saved there. If the acceptor later receives a ''Prepare'' or ''Accept!'' message, instead of performing its Phase1b or Phase2b action, it can simply inform the leader of the chosen value. : "Instead of sending the value v, the leader can send a hash of v to some acceptors in its ''Accept!'' messages. A learner will learn that v is chosen if it receives ''Accepted'' messages for either v or its hash from a quorum of acceptors, and at least one of those messages contains v rather than its hash. However, a leader could receive ''Promise'' messages that tell it the hash of a value v that it must use in its Phase2a action without telling it the actual value of v. If that happens, the leader cannot execute its Phase2a action until it communicates with some process that knows v." : "A proposer can send its proposal only to the leader rather than to all coordinators. However, this requires that the result of the leader-selection algorithm be broadcast to the proposers, which might be expensive. So, it might be better to let the proposer send its proposal to all coordinators. (In that case, only the coordinators themselves need to know who the leader is.) : "Instead of each acceptor sending ''Accepted'' messages to each learner, acceptors can send their ''Accepted'' messages to the leader and the leader can inform the learners when a value has been chosen. However, this adds an extra message delay. : "Finally, observe that phase 1 is unnecessary for round 1 .. The leader of round 1 can begin the round by sending an ''Accept!'' message with any proposed value."Cheap Paxos
Cheap Paxos extends"With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks."
Message flow: Cheap Multi-Paxos
An example involving three main acceptors, one auxiliary acceptor and quorum size of three, showing failure of one main processor and subsequent reconfiguration:Proposer Main Aux Learner , , , , , , -- Phase 2 -- X----------->, ->, ->, , , Accept!(N,I,V) , , , ! , , --- FAIL! --- , <-----------X--X--------------->, Accepted(N,I,V) , , , , , -- Failure detected (only 2 accepted) -- X----------->, ->, ------->, , Accept!(N,I,V) (re-transmit, include Aux) , <-----------X--X--------X------>, Accepted(N,I,V) , , , , , -- Reconfigure : Quorum = 2 -- X----------->, ->, , , Accept!(N,I+1,W) (Aux not participating) , <-----------X--X--------------->, Accepted(N,I+1,W) , , , , ,
Fast Paxos
Fast Paxos generalizesMessage flow: Fast Paxos, non-conflicting
Client Leader Acceptor Learner , , , , , , , , , X--------->, ->, ->, ->, , , Any(N,I,Recovery) , , , , , , , , X------------------->, ->, ->, ->, , , Accept!(N,I,W) , , <---------X--X--X--X------>, ->, Accepted(N,I,W) , <------------------------------------X--X Response(W) , , , , , , , ,
Message flow: Fast Paxos, conflicting proposals
Conflicting proposals with coordinated recovery. Note: the protocol does not specify how to handle the dropped client request.Client Leader Acceptor Learner , , , , , , , , , , , , , , , , , , , , , , , , , , , !! Concurrent conflicting proposals , , , , , , , , , !! received in different order , , , , , , , , , !! by the Acceptors , X--------------?, -?, -?, -?, , , Accept!(N,I,V) X-----------------?, -?, -?, -?, , , Accept!(N,I,W) , , , , , , , , , , , , , , , , , , !! Acceptors disagree on value , , , <-------X--X->, ->, ----->, ->, Accepted(N,I,V) , , , <-------, <-, <-X--X----->, ->, Accepted(N,I,W) , , , , , , , , , , , , , , , , , , !! Detect collision & recover , , X------->, ->, ->, ->, , , Accept!(N+1,I,W) , , , <-------X--X--X--X----->, ->, Accepted(N+1,I,W) , <---------------------------------X--X Response(W) , , , , , , , , ,Conflicting proposals with uncoordinated recovery.
Client Leader Acceptor Learner , , , , , , , , , , , X------->, ->, ->, ->, , , Any(N,I,Recovery) , , , , , , , , , , , , , , , , , , !! Concurrent conflicting proposals , , , , , , , , , !! received in different order , , , , , , , , , !! by the Acceptors , X--------------?, -?, -?, -?, , , Accept!(N,I,V) X-----------------?, -?, -?, -?, , , Accept!(N,I,W) , , , , , , , , , , , , , , , , , , !! Acceptors disagree on value , , , <-------X--X->, ->, ----->, ->, Accepted(N,I,V) , , , <-------, <-, <-X--X----->, ->, Accepted(N,I,W) , , , , , , , , , , , , , , , , , , !! Detect collision & recover , , , <-------X--X--X--X----->, ->, Accepted(N+1,I,W) , <---------------------------------X--X Response(W) , , , , , , , , ,
Message flow: Fast Paxos with uncoordinated recovery, collapsed roles
(merged Acceptor/Learner roles)Client Servers , , , , , , , , X->, ->, ->, Any(N,I,Recovery) , , , , , , , , , , , , !! Concurrent conflicting proposals , , , , , , !! received in different order , , , , , , !! by the Servers , X--------?, -?, -?, -?, Accept!(N,I,V) X-----------?, -?, -?, -?, Accept!(N,I,W) , , , , , , , , , , , , !! Servers disagree on value , , X<>X->, ->, Accepted(N,I,V) , , , <-, <-X<>X Accepted(N,I,W) , , , , , , , , , , , , !! Detect collision & recover , , X<>X<>X<>X Accepted(N+1,I,W) , <-----------X--X--X--X Response(W) , , , , , ,
Generalized Paxos
Generalized consensus explores the relationship between the operations of the replicated state machine and the consensus protocol that implements it. The main discovery involves optimizations of Paxos when conflicting proposals could be applied in any order. i.e., when the proposed operations are commutative operations for the state machine. In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operations. This concept is further generalized into ever-growing sequences of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sequences ensuring that all proposed operations of one sequence are stabilized before allowing any operation non-commuting with them to become stable.Example
In order to illustrate Generalized Paxos, the example below shows a message flow between two concurrently executing clients and a replicated state machine implementing read/write operations over two distinct registers A and B. Note that<1:Read(A), 2:Read(B), 3:Write(B), 4:Read(B), 5:Read(A), 6:Write(A)>Since
5:Read(A)
commutes with both 3:Write(B)
and 4:Read(B)
, one possible permutation equivalent to the previous order is the following:
<1:Read(A), 2:Read(B), 5:Read(A), 3:Write(B), 4:Read(B), 6:Write(A)>In practice, a commute occurs only when operations are proposed concurrently.
Message flow: Generalized Paxos (example)
Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see for a full discussion.Client Leader Acceptor Learner , , , , , , , , !! New Leader Begins Round , , X----->, ->, ->, , , Prepare(N) , , , <-----X- X- X , , Promise(N,null) , , X----->, ->, ->, , , Phase2Start(N,null) , , , , , , , , , , , , , , , , !! Concurrent commuting proposals , X------- ?, -----?, -?, -?, , , Propose(ReadA) X-----------?, -----?, -?, -?, , , Propose(ReadB) , , X------X-------------->, ->, Accepted(N,) , , , <--------X--X-------->, ->, Accepted(N, ) , , , , , , , , , , , , , , , , !! No Conflict, both accepted , , , , , , , , Stable = , , , , , , , , , , , , , , , , !! Concurrent conflicting proposals X-----------?, -----?, -?, -?, , , Propose( ) , X--------?, -----?, -?, -?, , , Propose(ReadB) , , , , , , , , , , X------X-------------->, ->, Accepted(N, . ) , , , <--------X--X-------->, ->, Accepted(N, . ) , , , , , , , , , , , , , , , , !! Conflict detected, leader chooses , , , , , , , , commutative order: , , , , , , , , V = , , , , , , , , , , X----->, ->, ->, , , Phase2Start(N+1,V) , , , <-----X- X- X-------->, ->, Accepted(N+1,V) , , , , , , , , Stable = . , , , , , , , , , , , , , , , , , , , , , , , , !! More conflicting proposals X-----------?, -----?, -?, -?, , , Propose(WriteA) , X--------?, -----?, -?, -?, , , Propose(ReadA) , , , , , , , , , , X------X-------------->, ->, Accepted(N+1, . ) , , , <--------X- X-------->, ->, Accepted(N+1, . ) , , , , , , , , , , , , , , , , !! Leader chooses order: , , , , , , , , W = , , , , , , , , , , X----->, ->, ->, , , Phase2Start(N+2,W) , , , <-----X- X- X-------->, ->, Accepted(N+2,W) , , , , , , , , Stable = . , , , , , , , , . , , , , , , , , , , , , , , , ,
Performance
The above message flow shows us that Generalized Paxos can leverage operation semantics to avoid collisions when the spontaneous ordering of the network fails. This allows the protocol to be in practice quicker than Fast Paxos. However, when a collision occurs, Generalized Paxos needs two additional round trips to recover. This situation is illustrated with operations WriteB and ReadB in the above schema. In the general case, such round trips are unavoidable and come from the fact that multiple commands can be accepted during a round. This makes the protocol more expensive than Paxos when conflicts are frequent. Hopefully two possible refinements of Generalized Paxos are possible to improve recovery time. * First, if the coordinator is part of every quorum of acceptors (round N is said ''centered''), then to recover at round N+1 from a collision at round N, the coordinator skips phase 1 and proposes at phase 2 the sequence it accepted last during round N. This reduces the cost of recovery to a single round trip. * Second, if both rounds N and N+1 use a unique and identical centered quorum, when an acceptor detects a collision at round N, it spontaneously proposes at round N+1 a sequence suffixing both (i) the sequence accepted at round N by the coordinator and (ii) the greatest non-conflicting prefix it accepted at round N. For instance, if the coordinator and the acceptor accepted respectively at round NByzantine Paxos
Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called Byzantine failures, after the solution popularized by Lamport. Byzantine Paxos introduced by Castro and Liskov adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:Message flow: Byzantine Multi-Paxos, steady state
Client Proposer Acceptor Learner , , , , , , , X-------->, , , , , , Request , X--------->, ->, ->, , , Accept!(N,I,V) , , X<>X<>X , , Verify(N,I,V) - BROADCAST , , <---------X--X--X------>, ->, Accepted(N,V) , <---------------------------------X--X Response(V) , , , , , , ,Fast Byzantine Paxos introduced by Martin and Alvisi removes this extra delay, since the client sends commands directly to the Acceptors. Note the ''Accepted'' message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends ''Accepted'' messages only to Learners):
Message flow: Fast Byzantine Multi-Paxos, steady state
Client Acceptor Learner , , , , , , X----->, ->, ->, , , Accept!(N,I,V) , X<>X<>X------>, ->, Accepted(N,I,V) - BROADCAST , <-------------------X--X Response(V) , , , , , ,The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value:
Message flow: Fast Byzantine Multi-Paxos, failure
Client Acceptor Learner , , , ! , , !! One Acceptor is faulty X----->, ->, ->! , , Accept!(N,I,V) , X<>X<>X------>, ->, Accepted(N,I,) - BROADCAST , , , ! , , !! Learners receive 2 different commands , , , ! , , !! Correct Acceptors notice error and choose , X<>X<>X------>, ->, Accepted(N,I,V) - BROADCAST , <-------------------X--X Response(V) , , , ! , ,
Adapting Paxos for RDMA networks
With the emergence of very high speed reliable datacenter networks that support remote DMA ( RDMA), there has been substantial interest in optimizing Paxos to leverage hardware offloading, in which the network interface card and network routers provide reliability and network-layer congestion control, freeing the host CPU for other tasks. ThProduction use of Paxos
* Google uses the Paxos algorithm in their Chubby distributed lock service in order to keep replicas consistent in case of failure. Chubby is used bySee also
*References
External links