HOME

TheInfoList



OR:

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
that studies distributed systems. The components of a distributed system interact with one another in order to achieve a common goal. Three significant challenges of distributed systems are: maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When a component of one system fails, the entire system does not fail. Examples of distributed systems vary from SOA-based systems to
massively multiplayer online game A massively multiplayer online game (MMOG or more commonly MMO) is an online video game with a large number of players, often hundreds or thousands, on the same server. MMOs usually feature a huge, persistent open world, although there are ...
s to peer-to-peer applications. A
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. Computer programs are one component of software, which also includes software documentation, documentation and oth ...
that runs within a distributed system is called a distributed program, and ''distributed programming'' is the process of writing such programs. There are many different types of implementations for the message passing mechanism, including pure HTTP, RPC-like connectors and message queues. ''Distributed computing'' also refers to the use of distributed systems to solve computational problems. In ''distributed computing'', a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other via message passing., p. 291–292. , p. 5.


Introduction

The word ''distributed'' in terms such as "distributed system", "distributed programming", and "
distributed algorithm A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as telecommunications, scientific ...
" originally referred to computer networks where individual computers were physically distributed within some geographical area. The terms are nowadays used in a much wider sense, even referring to autonomous
processes A process is a series or set of activities that interact to produce a result; it may occur once-only or be recurrent or periodic. Things called a process include: Business and management *Business process, activities that produce a specific se ...
that run on the same physical computer and interact with each other by message passing. While there is no single definition of a distributed system,, p. 10. the following defining properties are commonly used as: * There are several autonomous computational entities (''computers'' or '' nodes''), each of which has its own local
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remember ...
. * The entities communicate with each other by
message passing In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting ...
. A distributed system may have a common goal, such as solving a large computational problem; the user then perceives the collection of autonomous processors as a unit. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users. Other typical properties of distributed systems include the following: * The system has to tolerate failures in individual computers. * The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program. * Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.


Parallel and distributed computing

Distributed systems are groups of networked computers which share a common goal for their work. The terms "
concurrent computing Concurrent computing is a form of computing in which several computations are executed '' concurrently''—during overlapping time periods—instead of ''sequentially—''with one completing before the next starts. This is a property of a syst ...
", "
parallel computing Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different f ...
", and "distributed computing" have much overlap, and no clear distinction exists between them. The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. Parallel computing may be seen as a particular tightly coupled form of distributed computing, and distributed computing may be seen as a loosely coupled form of parallel computing. Nevertheless, it is possible to roughly classify concurrent systems as "parallel" or "distributed" using the following criteria: * In parallel computing, all processors may have access to a shared memory to exchange information between processors. * In distributed computing, each processor has its own private memory (
distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task mu ...
). Information is exchanged by passing messages between the processors. The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory. The situation is further complicated by the traditional uses of the terms parallel and distributed ''algorithm'' that do not quite match the above definitions of parallel and distributed ''systems'' (see
below Below may refer to: *Earth * Ground (disambiguation) *Soil *Floor * Bottom (disambiguation) *Less than *Temperatures below freezing *Hell or underworld People with the surname *Ernst von Below (1863–1955), German World War I general *Fred Below ...
for more detailed discussion). Nevertheless, as a rule of thumb, high-performance parallel computation in a shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms.


History

The use of concurrent processes which communicate through message-passing has its roots in
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
architectures studied in the 1960s. The first widespread distributed systems were
local-area networks A local area network (LAN) is a computer network that interconnects computers within a limited area such as a residence, school, laboratory, university campus or office building. By contrast, a wide area network (WAN) not only covers a larger ...
such as
Ethernet Ethernet () is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 1 ...
, which was invented in the 1970s.
ARPANET The Advanced Research Projects Agency Network (ARPANET) was the first wide-area packet-switched network with distributed control and one of the first networks to implement the TCP/IP protocol suite. Both technologies became the technical fou ...
, one of the predecessors of the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
, was introduced in the late 1960s, and ARPANET
e-mail Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic (digital) version of, or counterpart to, mail, at a time when "mail" meant ...
was invented in the early 1970s. E-mail became the most successful application of ARPANET, and it is probably the earliest example of a large-scale
distributed application A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
. In addition to ARPANET (and its successor, the global Internet), other early worldwide computer networks included
Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it wa ...
and FidoNet from the 1980s, both of which were used to support distributed discussion systems. The study of distributed computing became its own branch of computer science in the late 1970s and early 1980s. The first conference in the field,
Symposium on Principles of Distributed Computing The Symposium on Principles of Distributed Computing (PODC) is an academic conference in the field of distributed computing organised annually by the Association for Computing Machinery (special interest groups SIGACT and SIGOPS). Scope and re ...
(PODC), dates back to 1982, and its counterpart
International Symposium on Distributed Computing The International Symposium on Distributed Computing (DISC) is an annual academic conference for refereed presentations, whose focus is the theory, design, analysis, implementation, and application of distributed systems and networks. The Symposium ...
(DISC) was first held in Ottawa in 1985 as the International Workshop on Distributed Algorithms on Graphs.


Architectures

Various hardware and software architectures are used for distributed computing. At a lower level, it is necessary to interconnect multiple CPUs with some sort of network, regardless of whether that network is printed onto a circuit board or made up of loosely coupled devices and cables. At a higher level, it is necessary to interconnect
processes A process is a series or set of activities that interact to produce a result; it may occur once-only or be recurrent or periodic. Things called a process include: Business and management *Business process, activities that produce a specific se ...
running on those CPUs with some sort of
communication system A communications system or communication system is a collection of individual telecommunications networks, transmission systems, relay stations, tributary stations, and terminal equipment usually capable of interconnection and interoperat ...
. Distributed programming typically falls into one of several basic architectures: client–server, three-tier, ''n''-tier, or
peer-to-peer Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer ...
; or categories: loose coupling, or
tight coupling In computing and systems design, a loosely coupled system is one # in which components are weakly associated (have breakable relationships) with each other, and thus changes in one component least affect existence or performance of another com ...
. * Client–server: architectures where smart clients contact the server for data then format and display it to the users. Input at the client is committed back to the server when it represents a permanent change. * Three-tier: architectures that move the client intelligence to a middle tier so that stateless clients can be used. This simplifies application deployment. Most web applications are three-tier. * ''n''-tier: architectures that refer typically to web applications which further forward their requests to other enterprise services. This type of application is the one most responsible for the success of
application server An application server is a server that hosts applications or software that delivers a business application through a communication protocol. An application server framework is a service layer model. It includes software components available to a ...
s. *
Peer-to-peer Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer ...
: architectures where there are no special machines that provide a service or manage the network resources.Vigna P, Casey MJ. ''The Age of Cryptocurrency: How Bitcoin and the Blockchain Are Challenging the Global Economic Order'' St. Martin's Press January 27, 2015 Instead all responsibilities are uniformly divided among all machines, known as peers. Peers can serve both as clients and as servers. Examples of this architecture include BitTorrent and the bitcoin network. Another basic aspect of distributed computing architecture is the method of communicating and coordinating work among concurrent processes. Through various message passing protocols, processes may communicate directly with one another, typically in a master/slave relationship. Alternatively, a "database-centric" architecture can enable distributed computing to be done without any form of direct
inter-process communication In computer science, inter-process communication or interprocess communication (IPC) refers specifically to the mechanisms an operating system provides to allow the processes to manage shared data. Typically, applications can use IPC, categoriz ...
, by utilizing a shared
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
. Database-centric architecture in particular provides relational processing analytics in a schematic architecture allowing for live environment relay. This enables distributed computing functions both within and beyond the parameters of a networked database.


Applications

Reasons for using distributed systems and distributed computing may include: * The very nature of an application may ''require'' the use of a communication network that connects several computers: for example, data produced in one physical location and required in another location. * There are many cases in which the use of a single computer would be possible in principle, but the use of a distributed system is ''beneficial'' for practical reasons. For example: ** It can allow for much larger storage and memory, faster compute, and higher bandwidth than a single machine. ** It can provide more reliability than a non-distributed system, as there is no
single point of failure A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software ap ...
. Moreover, a distributed system may be easier to expand and manage than a monolithic uniprocessor system. ** It may be more cost-efficient to obtain the desired level of performance by using a
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study th ...
of several low-end computers, in comparison with a single high-end computer.


Examples

Examples of distributed systems and applications of distributed computing include the following: *
telecommunication Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that ...
networks: **
telephone network A telephone network is a telecommunications network that connects telephones, which allows telephone calls between two or more parties, as well as newer features such as fax and internet. The idea was revolutionized in the 1920s, as more and mor ...
s and
cellular network A cellular network or mobile network is a communication network where the link to and from end nodes is wireless. The network is distributed over land areas called "cells", each served by at least one fixed-location transceiver (typically th ...
s, **
computer network A computer network is a set of computers sharing resources located on or provided by network nodes. The computers use common communication protocols over digital interconnections to communicate with each other. These interconnections are ...
s such as the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
, ** wireless sensor networks, **
routing algorithm Routing is the process of selecting a path for traffic in a network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit-switched networks, such as the public switched telephone netw ...
s; * network applications: **
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...
and peer-to-peer networks, **
massively multiplayer online game A massively multiplayer online game (MMOG or more commonly MMO) is an online video game with a large number of players, often hundreds or thousands, on the same server. MMOs usually feature a huge, persistent open world, although there are ...
s and
virtual reality Virtual reality (VR) is a simulated experience that employs pose tracking and 3D near-eye displays to give the user an immersive feel of a virtual world. Applications of virtual reality include entertainment (particularly video games), edu ...
communities, ** distributed databases and distributed database management systems, **
network file system Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed. NFS, lik ...
s, ** distributed cache such as
burst buffer In the high-performance computing environment, burst buffer is a fast intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It bridges the performance gap between the processing speed of ...
s, ** distributed information processing systems such as banking systems and airline reservation systems; * real-time process control: **
aircraft An aircraft is a vehicle that is able to flight, fly by gaining support from the Atmosphere of Earth, air. It counters the force of gravity by using either Buoyancy, static lift or by using the Lift (force), dynamic lift of an airfoil, or in ...
control systems, ** industrial control systems; *
parallel computation Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different fo ...
: **
scientific computing Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
, including cluster computing,
grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from ...
,
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
, and various
volunteer computing projects Volunteering is a voluntary act of an individual or group freely giving time and labor for community service. Many volunteers are specifically trained in the areas they work, such as medicine, education, or emergency rescue. Others serv ...
, ** distributed rendering in computer graphics. *
peer-to-peer Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer ...


Theoretical foundations


Models

Many tasks that we would like to automate by using a computer are of question–answer type: we would like to ask a question and the computer should produce an answer. In
theoretical computer science computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumscribe the ...
, such tasks are called
computational problem In theoretical computer science, a computational problem is a problem that may be solved by an algorithm. For example, the problem of factoring :"Given a positive integer ''n'', find a nontrivial prime factor of ''n''." is a computational probl ...
s. Formally, a computational problem consists of ''instances'' together with a ''solution'' for each instance. Instances are questions that we can ask, and solutions are desired answers to these questions. Theoretical computer science seeks to understand which computational problems can be solved by using a computer (
computability theory Computability theory, also known as recursion theory, is a branch of mathematical logic, computer science, and the theory of computation that originated in the 1930s with the study of computable functions and Turing degrees. The field has sinc ...
) and how efficiently (
computational complexity theory In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and relating these classes to each other. A computational problem is a task solved ...
). Traditionally, it is said that a problem can be solved by using a computer if we can design an
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
that produces a correct solution for any given instance. Such an algorithm can be implemented as a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. Computer programs are one component of software, which also includes software documentation, documentation and oth ...
that runs on a general-purpose computer: the program reads a problem instance from
input Input may refer to: Computing * Input (computer science), the act of entering data into a computer or data processing system * Information, any data entered into a computer or data processing system * Input device * Input method * Input port (disa ...
, performs some computation, and produces the solution as output. Formalisms such as
random-access machine In computer science, random-access machine (RAM) is an abstract machine in the general class of register machines. The RAM is very similar to the counter machine but with the added capability of 'indirect addressing' of its registers. Like the c ...
s or
universal Turing machine In computer science, a universal Turing machine (UTM) is a Turing machine that can simulate an arbitrary Turing machine on arbitrary input. The universal machine essentially achieves this by reading both the description of the machine to be simu ...
s can be used as abstract models of a sequential general-purpose computer executing such an algorithm. The field of concurrent and distributed computing studies similar questions in the case of either multiple computers, or a computer that executes a network of interacting processes: which computational problems can be solved in such a network and how efficiently? However, it is not at all obvious what is meant by "solving a problem" in the case of a concurrent or distributed system: for example, what is the task of the algorithm designer, and what is the concurrent or distributed equivalent of a sequential general-purpose computer? The discussion below focuses on the case of multiple computers, although many of the issues are the same for concurrent processes running on a single computer. Three viewpoints are commonly used: ; Parallel algorithms in shared-memory model * All processors have access to a shared memory. The algorithm designer chooses the program executed by each processor. * One theoretical model is the parallel random-access machines (PRAM) that are used. However, the classical PRAM model assumes synchronous access to the shared memory. * Shared-memory programs can be extended to distributed systems if the underlying operating system encapsulates the communication between nodes and virtually unifies the memory across all individual systems. * A model that is closer to the behavior of real-world multiprocessor machines and takes into account the use of machine instructions, such as
Compare-and-swap In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of tha ...
(CAS), is that of ''asynchronous shared memory''. There is a wide body of work on this model, a summary of which can be found in the literature. ; Parallel algorithms in message-passing model * The algorithm designer chooses the structure of the network, as well as the program executed by each computer. * Models such as
Boolean circuits In computational complexity theory and circuit complexity, a Boolean circuit is a mathematical model for combinational digital logic circuits. A formal language can be decided by a family of Boolean circuits, one circuit for each possible inpu ...
and sorting networks are used. A Boolean circuit can be seen as a computer network: each gate is a computer that runs an extremely simple computer program. Similarly, a sorting network can be seen as a computer network: each comparator is a computer. ; Distributed algorithms in message-passing model * The algorithm designer only chooses the computer program. All computers run the same program. The system must work correctly regardless of the structure of the network. * A commonly used model is a
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
with one
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
per node. In the case of distributed algorithms, computational problems are typically related to graphs. Often the graph that describes the structure of the computer network ''is'' the problem instance. This is illustrated in the following example.


An example

Consider the computational problem of finding a coloring of a given graph ''G''. Different fields might take the following approaches: ; Centralized algorithms * The graph ''G'' is encoded as a string, and the string is given as input to a computer. The computer program finds a coloring of the graph, encodes the coloring as a string, and outputs the result. ; Parallel algorithms * Again, the graph ''G'' is encoded as a string. However, multiple computers can access the same string in parallel. Each computer might focus on one part of the graph and produce a coloring for that part. * The main focus is on high-performance computation that exploits the processing power of multiple computers in parallel. ; Distributed algorithms * The graph ''G'' is the structure of the computer network. There is one computer for each node of ''G'' and one communication link for each edge of ''G''. Initially, each computer only knows about its immediate neighbors in the graph ''G''; the computers must exchange messages with each other to discover more about the structure of ''G''. Each computer must produce its own color as output. * The main focus is on coordinating the operation of an arbitrary distributed system. While the field of parallel algorithms has a different focus than the field of distributed algorithms, there is much interaction between the two fields. For example, the Cole–Vishkin algorithm for graph coloring was originally presented as a parallel algorithm, but the same technique can also be used directly as a distributed algorithm. Moreover, a parallel algorithm can be implemented either in a parallel system (using shared memory) or in a distributed system (using message passing). The traditional boundary between parallel and distributed algorithms (choose a suitable network vs. run in any given network) does not lie in the same place as the boundary between parallel and distributed systems (shared memory vs. message passing).


Complexity measures

In parallel algorithms, yet another resource in addition to time and space is the number of computers. Indeed, often there is a trade-off between the running time and the number of computers: the problem can be solved faster if there are more computers running in parallel (see speedup). If a decision problem can be solved in
polylogarithmic time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
by using a polynomial number of processors, then the problem is said to be in the class NC. The class NC can be defined equally well by using the PRAM formalism or Boolean circuits—PRAM machines can simulate Boolean circuits efficiently and vice versa. In the analysis of distributed algorithms, more attention is usually paid on communication operations than computational steps. Perhaps the simplest model of distributed computing is a synchronous system where all nodes operate in a lockstep fashion. This model is commonly known as the LOCAL model. During each ''communication round'', all nodes in parallel (1) receive the latest messages from their neighbours, (2) perform arbitrary local computation, and (3) send new messages to their neighbors. In such systems, a central complexity measure is the number of synchronous communication rounds required to complete the task. This complexity measure is closely related to the
diameter In geometry, a diameter of a circle is any straight line segment that passes through the center of the circle and whose endpoints lie on the circle. It can also be defined as the longest chord of the circle. Both definitions are also valid f ...
of the network. Let ''D'' be the diameter of the network. On the one hand, any computable problem can be solved trivially in a synchronous distributed system in approximately 2''D'' communication rounds: simply gather all information in one location (''D'' rounds), solve the problem, and inform each node about the solution (''D'' rounds). On the other hand, if the running time of the algorithm is much smaller than ''D'' communication rounds, then the nodes in the network must produce their output without having the possibility to obtain information about distant parts of the network. In other words, the nodes must make globally consistent decisions based on information that is available in their ''local D-neighbourhood''. Many distributed algorithms are known with the running time much smaller than ''D'' rounds, and understanding which problems can be solved by such algorithms is one of the central research questions of the field. Typically an algorithm which solves a problem in polylogarithmic time in the network size is considered efficient in this model. Another commonly used measure is the total number of bits transmitted in the network (cf. communication complexity). The features of this concept are typically captured with the CONGEST(B) model, which is similarly defined as the LOCAL model, but where single messages can only contain B bits.


Other problems

Traditional computational problems take the perspective that the user asks a question, a computer (or a distributed system) processes the question, then produces an answer and stops. However, there are also problems where the system is required not to stop, including the dining philosophers problem and other similar
mutual exclusion In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurren ...
problems. In these problems, the distributed system is supposed to continuously coordinate the use of shared resources so that no conflicts or
deadlock In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a loc ...
s occur. There are also fundamental challenges that are unique to distributed computing, for example those related to ''fault-tolerance''. Examples of related problems include consensus problems,
Byzantine fault tolerance A Byzantine fault (also Byzantine generals problem, interactive consistency, source congruency, error avalanche, Byzantine agreement problem, and Byzantine failure) is a condition of a computer system, particularly distributed computing systems, ...
, and
self-stabilisation Self-stabilization is a concept of fault-tolerance in distributed systems. Given any initial state, a self-stabilizing distributed system will end up in a correct state in a finite number of execution steps. At first glance, the guarantee of s ...
. Much research is also focused on understanding the ''asynchronous'' nature of distributed systems: * Synchronizers can be used to run synchronous algorithms in asynchronous systems. *
Logical clock A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Often, distributed systems may have no physically synchronous global clock. In many applications (such as distributed GNU make), if two p ...
s provide a causal happened-before ordering of events. *
Clock synchronization Clock synchronization is a topic in computer science and engineering that aims to coordinate otherwise independent clocks. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks ...
algorithms provide globally consistent physical time stamps.


Election An election is a formal group decision-making process by which a population chooses an individual or multiple individuals to hold public office. Elections have been the usual mechanism by which modern representative democracy has operat ...

''Coordinator election'' (or ''leader election'') is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the "coordinator" (or leader) of the task, or unable to communicate with the current coordinator. After a coordinator election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task coordinator. The network nodes communicate among themselves in order to decide which of them will get into the "coordinator" state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the coordinator. The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token
ring network A ring network is a network topology in which each node connects to exactly two other nodes, forming a single continuous pathway for signals through each node – a ring. Data travels from node to node, with each node along the way handling eve ...
in which the token has been lost. Coordinator election algorithms are designed to be economical in terms of total
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
s transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the
Dijkstra Prize The Edsger W. Dijkstra Paper Prize in Distributed Computing is given for outstanding papers on the principles of distributed computing, whose significance and impact on the theory and/or practice of distributed computing has been evident for at le ...
for an influential paper in distributed computing. Many other algorithms were suggested for different kinds of network
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
s, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the coordinator election algorithm was suggested by Korach, Kutten, and Moran. In order to perform coordination, distributed systems employ the concept of coordinators. The coordinator election problem is to choose a process from among a group of processes on different processors in a distributed system to act as the central coordinator. Several central coordinator election algorithms exist.


Properties of distributed systems

So far the focus has been on ''designing'' a distributed system that solves a given problem. A complementary research problem is ''studying'' the properties of a given distributed system. The
halting problem In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a ...
is an analogous example from the field of centralised computation: we are given a computer program and the task is to decide whether it halts or runs forever. The halting problem is undecidable in the general case, and naturally understanding the behaviour of a computer network is at least as hard as understanding the behaviour of one computer. However, there are many interesting special cases that are decidable. In particular, it is possible to reason about the behaviour of a network of finite-state machines. One example is telling whether a given network of interacting (asynchronous and non-deterministic) finite-state machines can reach a deadlock. This problem is
PSPACE-complete In computational complexity theory, a decision problem is PSPACE-complete if it can be solved using an amount of memory that is polynomial in the input length (polynomial space) and if every other problem that can be solved in polynomial space can b ...
,, Section 19.3. i.e., it is decidable, but not likely that there is an efficient (centralised, parallel or distributed) algorithm that solves the problem in the case of large networks.


See also

*
Dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share ...
*
Distributed networking Distributed networking is a distributed computing network system where components of the program and data depend on multiple sources. Overview Distributed networking, used in distributed computing, is the network system over which computer prog ...
*
Decentralized computing Decentralized computing is the allocation of resources, both hardware and software, to each individual workstation, or office location. In contrast, centralized computing exists when the majority of functions are carried out, or obtained from a ...
*
Federation (information technology) A federation is a group of computing or network providers agreeing upon standards of operation in a collective fashion. The term may be used when describing the inter-operation of two distinct, formally disconnected, telecommunications networks ...
*
AppScale AppScale is a software company offering cloud infrastructure software and services to enterprises, government agencies, contractors, and third-party service providers. The company commercially supports one software product, AppScale ATS, a manage ...
* BOINC * Code mobility *
Distributed algorithm A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as telecommunications, scientific ...
* Distributed algorithmic mechanism design *
Distributed cache In computing, a distributed cache is an extension of the traditional concept of cache used in a single locale. A distributed cache may span multiple servers so that it can grow in size and in transactional capacity. It is mainly used to store app ...
*
Distributed operating system A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a ...
* Edsger W. Dijkstra Prize in Distributed Computing * Fog computing *
Folding@home Folding@home (FAH or F@h) is a volunteer computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements ...
*
Grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from ...
* Inferno * Jungle computing *
Layered queueing network In queueing theory, a discipline within the mathematical theory of probability, a layered queueing network (or rendezvous network) is a queueing network model where the service time for each job at each service node is given by the response time ...
* Library Oriented Architecture (LOA) * List of distributed computing conferences *
List of volunteer computing projects This is a comprehensive list of volunteer computing projects; a type of distributed computing where volunteers donate computing time to specific causes. The donated computing power comes from idle CPUs and GPUs in personal computers, video gam ...
* List of important publications in concurrent, parallel, and distributed computing *
Model checking In computer science, model checking or property checking is a method for checking whether a finite-state model of a system meets a given specification (also known as correctness). This is typically associated with hardware or software system ...
* Parallel distributed processing *
Parallel programming model In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its ''generality ...
*
Plan 9 from Bell Labs Plan 9 from Bell Labs is a distributed operating system which originated from the Computing Science Research Center (CSRC) at Bell Labs in the mid-1980s and built on UNIX concepts first developed there in the late 1960s. Since 2000, Plan 9 has be ...
* Shared nothing architecture * Eventual programming *
Actor model The actor model in computer science is a mathematical model of concurrent computation that treats ''actor'' as the universal primitive of concurrent computation. In response to a message it receives, an actor can: make local decisions, create mor ...
*
Eventual consistency Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last up ...


Notes


References

; Books * . * . * . * . * . * . * . * . * . * . ; Articles * . * . * . * . ; Web sites * *


Further reading

; Books * . * * . *
Java Distributed Computing by Jim Faber, 1998
* . * * * ; Articles * . * ; Conference Papers *


External links

* * * {{DEFAULTSORT:Distributed Computing Decentralization