The bulk synchronous parallel (BSP)
abstract computer is a
bridging model
In computer science, a bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine; in other words, it is int ...
for designing
parallel algorithms. It is similar to the
parallel random access machine
In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confus ...
(PRAM) model, but unlike PRAM, BSP does not take communication and synchronization for granted. In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.
History
The BSP model was developed by
Leslie Valiant of
Harvard University during the 1980s. The definitive article was published in 1990.
[Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 199]
/ref>
Between 1990 and 1992, Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model, in Princeton and at Harvard. Between 1992 and 1997, McColl led a large research team at Oxford that developed various BSP programming libraries, languages and tools, and also numerous massively parallel BSP algorithms, including many early examples of high-performance communication-avoiding parallel algorithms
[
W F McColl. Scalable Computing. Computer Science Today: Recent Trends and Developments. J van Leeuwen (editor). LNCS Volume 1000, Springer-Verlag pp.46-61 (1995]
/ref>
and recursive "immortal" parallel algorithms that achieve the best possible performance and optimal parametric tradeoffs.[
W F McColl and A Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3) pp.287-297 (1999]
/ref>
With interest and momentum growing, McColl then led a group from Oxford, Harvard, Florida, Princeton, Bell Labs, Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996.[ J M D Hill, W F McColl, D C Stefanescu, M W Goudreau, K Lang, S B Rao, T Suel, T Tsantilas and R H Bisseling. BSPlib: The BSP Programming Library. Parallel Computing 24 (14) pp. 1947-1980 (1998]
/ref>
Valiant developed an extension to the BSP model in the 2000s, leading to the publication of the Multi-BSP model in 2011.[Valiant, L. G. (2011). A bridging model for multi-core computing. ''Journal of Computer and System Sciences'', 77(1), 154-16]
/ref>
In 2017, McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large-scale parallel computations in AI, Analytics and high-performance computing (HPC).[A Bridging Model for High Performance Cloud Computing by Bill McColl in 18th SIAM Conference on Parallel Processing for Scientific Computing (2018), http://meetings.siam.org/sess/dsp_talk.cfm?p=88973 .] See also
[
Bill McColl. Mathematics, Models and Architectures. Chapter 1, pp. 6-53. Mathematics for Future Computing and Communications, edited by Liao Heng and Bill McColl. Cambridge University Press (2022)]
/ref>
The BSP model
Overview
A BSP computer consists of the following:
* Components capable of processing and/or local memory transactions (i.e., processors),
* A network that routes messages between pairs of such components, and
* A hardware facility that allows for the synchronization of all or a subset of components.
This is commonly interpreted as a set of processors that may follow different thread (computer science), threads of computation, with each processor equipped with fast local memory and interconnected by a communication network.
BSP algorithms rely heavily on the third feature; a computation proceeds in a series of global ''supersteps'', which consists of three components:
* Concurrent computation: every participating processor may perform local computations, i.e., each process can only make use of values stored in the local fast memory of the processor. The computations occur asynchronously of all the others but may overlap with communication.
* Communication: The processes exchange data to facilitate remote data storage.
* Barrier synchronization: When a process reaches this point (the ''barrier''), it waits until all other processes have reached the same barrier.
The computation and communication actions do not have to be ordered in time. Communication typically takes the form of the one-sided ''PUT'' and ''GET'' remote direct memory access (RDMA) calls rather than paired two-sided ''send'' and ''receive'' message-passing calls.
The barrier synchronization concludes the superstep—it ensures that all one-sided communications are properly concluded. Systems based on two-sided communication include this synchronization cost implicitly for every message sent. The barrier synchronization method relies on the BSP computer's hardware facility. In Valiant's original paper, this facility periodically checks if the end of the current superstep is reached globally. The period of this check is denoted by .
The BSP model is also well-suited for automatic memory management for distributed-memory computing through over-decomposition of the problem and oversubscription of the processors. The computation is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfect load balancing, both of work and communication.
Communication
In many parallel programming systems, communications are considered at the level of individual actions, such as sending and receiving a message or memory-to-memory transfer. This is difficult to work with since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.
The BSP model considers communication actions ''en masse''. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit and assumes all individual messages sent as part of this unit have a fixed size.
The maximum number of incoming or outgoing messages for a superstep is denoted by . The ability of a communication network to deliver data is captured by a parameter , defined such that it takes time for a processor to deliver messages of size 1.
A message of length obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of or messages of length 1. In either case, the cost is said to be .
The parameter depends on the following:
* The protocols used to interact within the communication network.
* Buffer management by both the processors and the communication network.
* The routing strategy used in the network.
* The BSP runtime system.
In practice, is determined empirically for each parallel computer. Note that is not the normalized single-word delivery time but the single-word delivery time under continuous traffic conditions.
Barriers
The one-sided communication of the BSP model requires barrier synchronization.
Barriers are potentially costly but avoid the possibility of deadlock or livelock, since barriers cannot create circular data dependencies. Tools to detect them and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance.
The cost of barrier synchronization is influenced by a couple of issues:
* The cost imposed by the variation in the completion time of the participating concurrent computations. Take the example where all but one of the processes have completed their work for this superstep, and are waiting for the last process, which still has a lot of work to complete. The best that an implementation can do is ensure that each process works on roughly the same problem size.
* The cost of reaching a globally consistent state in all of the processors. This depends on the communication network but also on whether there is special-purpose hardware available for synchronizing and on the way in which interrupts are handled by processors.
The cost of a barrier synchronization is denoted by . Note that