A Byzantine fault (also Byzantine generals problem, interactive consistency, source congruency, error avalanche, Byzantine agreement problem, and Byzantine failure) is a condition of a computer system, particularly

distributed computing A distributed system is a system whose components are located on different computer network, networked computers, which communicate and coordinate their actions by message passing, passing messages to one another from any system. Distributed com ...

systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an

allegory As a literary device or artistic form, an allegory is a narrative or visual representation in which a character, place, or event can be interpreted to represent a hidden meaning with moral or political significance. Authors have used allegory th ...

, the "Byzantine generals problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable. In a Byzantine fault, a component such as a

server Server may refer to: Computing *Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients Role * Waiting staff, those who work at a restaurant or a bar attending customers and su ...

can inconsistently appear both failed and functioning to failure-detection systems, presenting different symptoms to different observers. It is difficult for the other components to declare it failed and shut it out of the network, because they need to first reach a consensus regarding which component has failed in the first place. Byzantine fault tolerance (BFT) is the resiliency of a

fault-tolerant computer system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...

to such conditions.

Analogy

In its simplest form, a number of generals are attacking a fortress and they must decide as a group whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The important thing is that all generals agree on a common decision, for a halfhearted attack by a few generals would become a

rout A rout is a panicked, disorderly and undisciplined retreat of troops from a battlefield, following a collapse in a given unit's command authority, unit cohesion and combat morale (''esprit de corps''). History Historically, lightly-equi ...

, and would be worse than either a coordinated attack or a coordinated retreat. The problem is complicated by the presence of treacherous generals who may not only cast a vote for a suboptimal strategy, they may do so selectively. For instance, if nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general may send a vote of retreat to those generals in favor of retreat, and a vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will attack (which may not go well for the attackers). The problem is complicated further by the generals being physically separated and having to send their votes via messengers who may fail to deliver votes or may forge false votes.

Resolution

Byzantine fault tolerance can be achieved if the loyal (non-faulty) generals have a majority agreement on their strategy. There can be a default vote value given to missing messages. For example, missing messages can be given a "null" value. Further, if the agreement is that the null votes are in the majority, a pre-assigned default strategy can be used (e.g. retreat). The typical mapping of this story onto computer systems is that the computers are the generals and their digital communication system links are the messengers. Although the problem is formulated in the analogy as a decision-making and security problem, in electronics, it cannot be solved by

cryptographic Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adve ...

digital signature A digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. A valid digital signature, where the prerequisites are satisfied, gives a recipient very high confidence that the message was created b ...

s alone, because failures such as incorrect voltages can propagate through the encryption process. Thus, a component may appear functioning to one component and faulty to another, which prevents forming a consensus as to whether the component is faulty or not.

Characteristics

A Byzantine fault is any fault presenting different symptoms to different observers. A Byzantine failure is the loss of a system service due to a Byzantine fault in systems that require consensus among distributed nodes. The objective of Byzantine fault tolerance is to be able to defend against failures of system components with or without symptoms that prevent other components of the system from reaching an agreement among themselves, where such an agreement is needed for the correct operation of the system. The remaining operationally correct components of a Byzantine fault tolerant system will be able to continue providing the system's service as originally intended, assuming there are a sufficient number of accurately-operating components to maintain the service. Byzantine failures are considered the most general and most difficult class of failures among the failure modes. The so-called fail-stop failure mode occupies the simplest end of the spectrum. Whereas fail-stop failure mode simply means that the only way to fail is a

node In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex). Node may refer to: In mathematics * Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two or more curves, lines ...

crash, detected by other nodes, Byzantine failures imply no restrictions, which means that the failed node can generate arbitrary data, including data that makes it appear like a functioning node. Thus, Byzantine failures can confuse failure detection systems, which makes fault tolerance difficult. Despite the analogy, a Byzantine failure is not necessarily a

security Security is protection from, or resilience against, potential harm (or other unwanted coercive change) caused by others, by restraining the freedom of others to act. Beneficiaries (technically referents) of security may be of persons and social ...

problem involving hostile human interference: it can arise purely from electrical or software faults. The terms fault and failure are used here according to the standard definitions originally created by a joint committee on "Fundamental Concepts and Terminology" formed by the

IEEE The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...

Computer Society's Technical Committee on Dependable Computing and Fault-Tolerance and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance. See also

dependability In systems engineering, dependability is a measure of a system's availability, reliability, maintainability, and in some cases, other characteristics such as durability, safety and security. In real-time computing, dependability is the ability to ...

Caveat

Byzantine fault tolerance is only concerned with broadcast consistency, that is, the property that when one component broadcasts a single consistent value to other components (i.e., sends the same value to the other components), they all receive exactly the same value, or in the case that the broadcaster is not consistent, the other components agree on a common value. This kind of fault tolerance does not encompass the correctness of the value itself; for example, an adversarial component that deliberately sends an incorrect value, but sends that same value consistently to all components, will not be caught in the Byzantine fault tolerance scheme.

Formal definition

Setting: Given a system of components, of which are dishonest, and assuming only point-to-point channel between all the components. Whenever a component tries to broadcast a value , the other components are allowed to discuss with each other and verify the consistency of 's broadcast, and eventually settle on a common value . Property: The system is said to resist Byzantine faults if a component can broadcast a value , and then: # If is honest, then all honest components agree on the value . # In any case, all honest components agree on the same value . Variants: The problem has been studied in the case of both synchronous and asynchronous communications. The communication graph above is assumed to be the complete graph (i.e. each component can discuss with every other), but the communication graph can be restricted. It can also be relaxed in a more "realistic" problem where the faulty components do not collude together in an attempt to lure the others into error. It is in this setting that practical algorithms have been devised.

History

The problem of obtaining Byzantine consensus was conceived and formalized by Robert Shostak, who dubbed it the ''interactive consistency'' problem. This work was done in 1978 in the context of the NASA-sponsored SIFT project in the Computer Science Lab at

SRI International SRI International (SRI) is an American nonprofit scientific research institute and organization headquartered in Menlo Park, California. The trustees of Stanford University established SRI in 1946 as a center of innovation to support economic ...

. SIFT (for Software Implemented Fault Tolerance) was the brain child of John Wensley, and was based on the idea of using multiple general-purpose computers that would communicate through pairwise messaging in order to reach a consensus, even if some of the computers were faulty. At the beginning of the project, it was not clear how many computers in total were needed to guarantee that a conspiracy of ''n'' faulty computers could not "thwart" the efforts of the correctly-operating ones to reach consensus. Shostak showed that a minimum of 3''n+''1 are needed, and devised a two-round 3''n+1'' messaging protocol that would work for ''n''=1. His colleague Marshall Pease generalized the algorithm for any n > 0, proving that 3''n''+1 is both necessary and sufficient. These results, together with a later proof by

Leslie Lamport Leslie B. Lamport (born February 7, 1941 in Brooklyn) is an American computer scientist and mathematician. Lamport is best known for his seminal work in distributed systems, and as the initial developer of the document preparation system LaTeX an ...

of the sufficiency of 3''n'' using digital signatures, were published in the seminal paper, ''Reaching Agreement in the Presence of Faults.'' The authors were awarded the 2005

Edsger W. Dijkstra Prize The Edsger W. Dijkstra Paper Prize in Distributed Computing is given for outstanding papers on the principles of distributed computing, whose significance and impact on the theory and/or practice of distributed computing has been evident for at lea ...

for this paper. To make the interactive consistency problem easier to understand, Lamport devised a colorful allegory in which a group of army generals formulate a plan for attacking a city. In its original version, the story cast the generals as commanders of the

Albania Albania ( ; sq, Shqipëri or ), or , also or . officially the Republic of Albania ( sq, Republika e Shqipërisë), is a country in Southeastern Europe. It is located on the Adriatic and Ionian Seas within the Mediterranean Sea and shares ...

n army. The name was changed, eventually settling on "

Byzantine The Byzantine Empire, also referred to as the Eastern Roman Empire or Byzantium, was the continuation of the Roman Empire primarily in its eastern provinces during Late Antiquity and the Middle Ages, when its capital city was Constantinopl ...

", at the suggestion of Jack Goldberg to future-proof any potential offense giving. This formulation of the problem, together with some additional results, were presented by the same authors in their 1982 paper, "The Byzantine Generals Problem".

Examples

Several examples of Byzantine failures that have occurred are given in two equivalent journal papers. These and other examples are described on the

NASA The National Aeronautics and Space Administration (NASA ) is an independent agency of the US federal government responsible for the civil space program, aeronautics research, and space research. NASA was established in 1958, succeeding t ...

DASHlink web pages. Byzantine errors were observed infrequently and at irregular points during endurance testing for the newly constructed ''Virginia'' class submarines, at least through 2005 (when the issues were publicly reported).

Implementations

One example of BFT in use is

Bitcoin Bitcoin ( abbreviation: BTC; sign: ₿) is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public distr ...

, a peer-to-peer digital cash system. The

Bitcoin network The bitcoin network is a peer-to-peer payment network that operates on a cryptographic protocol. Users send and receive bitcoins, the units of currency, by broadcasting digitally-signed messages to the network using bitcoin cryptocurren ...

works in parallel to generate a blockchain with

proof-of-work Proof of work (PoW) is a form of cryptographic proof in which one party (the ''prover'') proves to others (the ''verifiers'') that a certain amount of a specific computational effort has been expended. Verifiers can subsequently confirm this expe ...

allowing the system to overcome Byzantine failures and reach a coherent global view of the system's state. Some

proof of stake Proof-of-stake (PoS) protocols are a class of consensus mechanisms for blockchains that work by selecting validators in proportion to their quantity of holdings in the associated cryptocurrency. This is done to avoid the computational cost of ...

blockchains also use BFT algorithms. Some aircraft systems, such as the Boeing 777 Aircraft Information Management System (via its ARINC 659 SAFEbus network), the Boeing 777 flight control system, and the Boeing 787 flight control systems use Byzantine fault tolerance; because these are real-time systems, their Byzantine fault tolerance solutions must have very low latency. For example, SAFEbus can achieve Byzantine fault tolerance within the order of a microsecond of added latency. The

SpaceX Dragon American private space transportation company SpaceX has developed and produced several spacecraft named Dragon. The first family member, now referred to as Dragon 1, flew 23 cargo missions to the ISS between 2010 and 2020 before being retired ...

considers Byzantine fault tolerance in its design. Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms. For systems that have a high degree of safety or security criticality, these assumptions must be proven to be true to an acceptable level of

fault coverage {{Unreferenced, date=November 2022 Fault coverage refers to the percentage of some type of fault that can be detected during the test of any engineered system. High fault coverage is particularly valuable during manufacturing test, and techniques ...

. When providing proof through testing, one difficulty is creating a sufficiently wide range of signals with Byzantine symptoms. Such testing likely will require specialized fault injectors.

Solutions

Several early solutions were described by Lamport, Shostak, and Pease in 1982. They began by noting that the Generals' Problem can be reduced to solving a "Commander and Lieutenants" problem where loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is loyal: * One solution considers scenarios in which messages may be forged, but which will be ''Byzantine-fault-tolerant'' as long as the number of disloyal generals is less than one third of the generals. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the one Commander and two Lieutenants problem cannot be solved, if the Commander is traitorous. To see this, suppose we have a traitorous Commander A, and two Lieutenants, B and C: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it is not necessarily A—the other Lieutenant could have forged the message purportedly from A. It can be shown that if ''n'' is the number of generals in total, and ''t'' is the number of traitors in that ''n'', then there are solutions to the problem only when ''n'' > 3''t'' and the communication is synchronous (bounded delay). * A second solution requires unforgeable message signatures. For security-critical systems,

s (in modern computer systems, this may be achieved in practice using

public-key cryptography Public-key cryptography, or asymmetric cryptography, is the field of cryptographic systems that use pairs of related keys. Each key pair consists of a public key and a corresponding private key. Key pairs are generated with cryptographic alg ...

) can provide Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals. However, for

safety-critical system A safety-critical system (SCS) or life-critical system is a system whose failure or malfunction may result in one (or more) of the following outcomes: * death or serious injury to people * loss or severe damage to equipment/property * environme ...

s (where "security" addresses intelligent threats while "safety" addresses the inherent dangers of an activity or mission), simple error detecting codes, such as CRCs, provide weaker but often sufficient coverage at a much lower cost. This is true for both Byzantine and non-Byzantine faults. Furthermore, sometimes security measures weaken safety and vice versa. Thus, cryptographic digital signature methods are not a good choice for safety-critical systems, unless there is also a specific security threat as well. While error detecting codes, such as CRCs, are better than cryptographic techniques, neither provide adequate coverage for active electronics in safety-critical systems. This is illustrated by the ''Schrödinger CRC'' scenario where a CRC-protected message with a single Byzantine faulty bit presents different data to different observers and each observer sees a valid CRC. * Also presented is a variation on the first two solutions allowing Byzantine-fault-tolerant behavior in some situations where not all generals can communicate directly with each other. Several system architectures were designed c. 1980 that implemented Byzantine fault tolerance. These include: Draper's FTMP, Honeywell's MMFCS, and SRI's SIFT.

Advanced solutions

In 1999, Miguel Castro and

Barbara Liskov Barbara Liskov (born November 7, 1939 as Barbara Jane Huberman) is an American computer scientist who has made pioneering contributions to programming languages and distributed computing. Her notable work includes the development of the Liskov ...

introduced the "Practical Byzantine Fault Tolerance" (PBFT) algorithm, which provides high-performance Byzantine state machine replication, processing thousands of requests per second with sub-millisecond increases in latency. After PBFT, several BFT protocols were introduced to improve its robustness and performance. For instance, Q/U, HQ, Zyzzyva, and ABsTRACTs, addressed the performance and cost issues; whereas other protocols, like Aardvark and RBFT, addressed its robustness issues. Furthermore, Adapt tried to make use of existing BFT protocols, through switching between them in an adaptive way, to improve system robustness and performance as the underlying conditions change. Furthermore, BFT protocols were introduced that leverage trusted components to reduce the number of replicas, e.g., A2M-PBFT-EA and MinBFT.

References

Sources

External links

Byzantine Fault Tolerance in the RKBExplorer
Public-key cryptography Distributed computing problems Fault-tolerant computer systems Theory of computation