parallel computing Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different fo ...

, work stealing is a

scheduling A schedule or a timetable, as a basic time-management tool, consists of a list of times at which possible task (project management), tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order ...

strategy for multithreaded computer programs. It solves the problem of executing a ''dynamically multithreaded'' computation, one that can "spawn" new threads of execution, on a ''statically multithreaded'' computer, with a fixed number of processors (or cores). It does so efficiently in terms of execution time, memory usage, and inter-processor communication. In a work stealing scheduler, each processor in a computer system has a queue of work items (computational tasks, threads) to perform. Each work item consists of a series of instructions, to be executed sequentially, but in the course of its execution, a work item may also ''spawn'' new work items that can feasibly be executed in parallel with its other work. These new items are initially put on the queue of the processor executing the work item. When a processor runs out of work, it looks at the queues of the other processors and "steals" their work items. In effect, work stealing distributes the scheduling work over idle processors, and as long as all processors have work to do, no scheduling overhead occurs. Work stealing contrasts with ''work sharing'', another popular scheduling approach for dynamic multithreading, where each work item is scheduled onto a processor when it is spawned. Compared to this approach, work stealing reduces the amount of

process migration In computing, process migration is a specialized form of process management (computing), process management whereby Process (computing), processes are moved from one computing environment to another. This originated in distributed computing, but is ...

between processors, because no such migration occurs when all processors have work to do. The idea of work stealing goes back to the implementation of the

Multilisp MultiLisp is a functional programming language, a dialect of the language Lisp, and of its dialect Scheme, extended with constructs for parallel computing execution and shared memory. These extensions involve side effects, rendering MultiLisp nond ...

programming language and work on parallel

functional programming In computer science, functional programming is a programming paradigm where programs are constructed by Function application, applying and Function composition (computer science), composing Function (computer science), functions. It is a declar ...

languages in the 1980s. It is employed in the scheduler for the

Cilk Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops ...

programming language, the

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...

fork/join framework, the .NET

Task Parallel Library Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed ...

, and the

Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...

Tokio runtime.

Execution model

Work stealing is designed for a "strict"

fork–join model In parallel computing, the fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential ...

of parallel computation, which means that a computation can be viewed as a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...

with a single source (start of computation) and a single sink (end of computation). Each node in this graph represents either a ''fork'' or a ''join''. Forks produce multiple ''logically parallel'' computations, variously called "threads" or "strands". Edges represent serial computation.In the original presentation, serial computations were represented as nodes as well, and a directed edge represented the relation "is followed by". As an example, consider the following trivial fork–join program in

-like syntax: function f(a, b): c ← fork g(a) d ← h(b) join return c + d function g(a): return a × 2 function h(a): b ← fork g(a) c ← a + 1 join return b + c The function call gives rise to the following computation graph: Fork-join computation

In the graph, when two edges leave a node, the computations represented by the edge labels are logically parallel: they may be performed either in parallel, or sequentially. The computation may only proceed past a ''join'' node when the computations represented by its incoming edges are complete. The work of a scheduler, now, is to assign the computations (edges) to processors in a way that makes the entire computation run to completion in the correct order (as constrained by the join nodes), preferably as fast as possible.

Algorithm

The

randomized In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...

version of the work stealing algorithm presented by Blumofe and Leiserson maintains several threads of execution and schedules these onto

P

processors. Each of the processors has a

double-ended queue In computer science, a double-ended queue (abbreviated to deque, pronounced ''deck'', like "cheque") is an abstract data type that generalizes a queue, for which elements can be added to or removed from either the front (head) or back (tail). I ...

(deque) of threads. Call the ends of the deque "top" and "bottom". Each processor that has a current thread to execute, executes the instructions in the thread one by one, until it encounters an instruction that causes one of four "special" behaviors: * A ''spawn'' instruction causes a new thread to be created. The current thread is placed at the bottom of the deque, and the processor starts executing the new thread. * A ''stalling'' instruction is one that temporarily halts execution of its thread. The processor pops a thread off the bottom of its deque and starts executing that thread. If its deque is empty, it starts work stealing, explained below. * An instruction may cause a thread to ''die''. The behavior in this case is the same as for an instruction that stalls. * An instruction may ''enable'' another thread. The other thread is pushed onto the bottom of the deque, but the processor continues execution of its current thread. Initially, a computation consists of a single thread and is assigned to some processor, while the other processors start off idle. Any processor that becomes idle starts the actual process of work stealing, which means the following: * it picks another processor uniformly at random; * if the other processor's deque is non-empty, it pops the top-most thread off the deque and starts executing that; * else, repeat.

Child stealing vs. continuation stealing

Note that, in the rule for ''spawn'', Blumofe and Leiserson suggest that the "parent" thread execute its new thread, as if performing a function call (in the C-like program , the function call to completes before the call to is performed). This is called "continuation stealing", because the

continuation In computer science, a continuation is an abstract representation of the control state of a computer program. A continuation implements ( reifies) the program control state, i.e. the continuation is a data structure that represents the computati ...

of the function can be stolen while the spawned thread is executed, and is the scheduling algorithm used in

Cilk Plus Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops ...

. It is not the only way to implement work stealing; the alternative strategy is called "child stealing" and is easier to implement as a

library A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...

, without

compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...

support. Child stealing is used by

Threading Building Blocks oneAPI Threading Building Blocks (oneTBB; formerly Threading Building Blocks or TBB), is a C++ template library developed by Intel for parallel programming on multi-core processors. Using TBB, a computation is broken down into tasks that can ru ...

, Microsoft's Task Parallel Library and

OpenMP OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating syste ...

, although the latter gives the programmer control over which strategy is used.

Efficiency

Several variants of work stealing have been proposed. The

variant due to Blumofe and Leiserson executes a parallel computation in

expected time In computational complexity theory, the average-case complexity of an algorithm is the amount of some computational resource (typically time) used by the algorithm, averaged over all possible inputs. It is frequently contrasted with worst-case comp ...

T_1/P + O(T_\infty)

P

processors; here,

T_1

is the ''work'', or the amount of time required to run the computation on a serial computer, and

T_\infty

is the ''span'', the amount of time required on an infinitely parallel machine.See

analysis of parallel algorithms In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysi ...

for definitions. This means that, in expectation, the time required is at most a constant factor times the theoretical minimum. However, the running time (in particular, the number of steals executed) can be exponential in

T_\infty

in the worst case. A localized variant, in which a processor attempts to steal back its own work whenever it is free, has also been analyzed theoretically and practically.

Space usage

A computation scheduled by the Blumofe–Leiserson version of work stealing uses

O(S_1 P)

stack space, if

S_1

were the stack usage of the same computation on a single processor, fitting the authors' own earlier definition of space efficiency. This bound requires continuation stealing; in a child stealing scheduler, it does not hold, as can be seen from the following example: for i = 0 to n: fork f(i) join In a child-stealing implementation, all "forked" calls to are put in a work queue that thus grows to size , which can be made arbitrarily large.

Multiprogramming variant

The work stealing algorithm as outlined earlier, and its analysis, assume a computing environment where a computation is scheduled onto a set of dedicated processors. In a

multiprogramming In computing, multitasking is the concurrent execution of multiple tasks (also known as processes) over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result ...

(multi-tasking) environment, the algorithm must be modified to instead schedule computation tasks onto a pool of worker threads, which in turn are scheduled onto the actual processors by an

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...

scheduler. At any given time, the OS scheduler will assign to the work stealing process some number of the processors in the computer, because other processes may be using the remaining processors. In this setting, work stealing with a pool of worker threads has the problem that workers acting as thieves may cause

livelock In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a loc ...

: they may block the execution of workers that would actually spawn useful tasks. A variant of work stealing has been devised for this situation, which executes a computation in expected time

O\left(\frac + \frac\right),

where is the average number of processors allocated to the computation by the OS scheduler over the computation's running time. The multiprogramming work-scheduler differs from the traditional version in two respects: * Its queues are non-blocking. While on dedicated processors, access to the queues can be synchronized using

locks Lock(s) may refer to: Common meanings *Lock and key, a mechanical device used to secure items of importance *Lock (water navigation), a device for boats to transit between different levels of water, as in a canal Arts and entertainment * ''Lock ...

, this is not advisable in a multiprogramming environment since the operating system might

preempt Preempt (also spelled "pre-empt") is a bid in contract bridge whose primary objectives are (1) to thwart opponents' ability to bid to their best contract, with some safety, and (2) to fully describe one's hand to one's partner in a single bid. A ...

the worker thread holding the lock, blocking the progress of any other workers that try to access the same queue. * Before each attempt to steal work, a worker thread calls a "" system call that yields the processor on which it is scheduled to the OS, in order to prevent

starvation Starvation is a severe deficiency in caloric energy intake, below the level needed to maintain an organism's life. It is the most extreme form of malnutrition. In humans, prolonged starvation can cause permanent organ damage and eventually, dea ...

. Attempts to improve on the multiprogramming work stealer have focused on

cache locality In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...

issues and improved queue data structures.

Alternatives

Several scheduling algorithms for dynamically multithreaded computations compete with work stealing. Besides the traditional work sharing approach, there is a scheduler called ''parallel depth-first'' (PDF) that improves on the space bounds of work stealing, as well giving better performance in some situations where the cores of a

chip multiprocessor A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions (such a ...

share a cache.

Notes

References

{{reflist, 30em Processor scheduling algorithms Parallel computing