Prefix Sum
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers is a second sequence of numbers , the sums of
prefix A prefix is an affix which is placed before the Word stem, stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy'' ...
es ( running totals) of the input sequence: : : : :... For instance, the prefix sums of the
natural number In mathematics, the natural numbers are those numbers used for counting (as in "there are ''six'' coins on the table") and ordering (as in "this is the ''third'' largest city in the country"). Numbers used for counting are called ''Cardinal n ...
s are the
triangular number A triangular number or triangle number counts objects arranged in an equilateral triangle. Triangular numbers are a type of figurate number, other examples being square numbers and cube numbers. The th triangular number is the number of dots in ...
s: : Prefix sums are trivial to compute in sequential models of computation, by using the formula to compute each output value in sequence order. However, despite their ease of computation, prefix sums are a useful primitive in certain algorithms such as
counting sort In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess dis ...
,. and they form the basis of the scan higher-order function in
functional programming In computer science, functional programming is a programming paradigm where programs are constructed by Function application, applying and Function composition (computer science), composing Function (computer science), functions. It is a declar ...
languages. Prefix sums have also been much studied in
parallel algorithm In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machin ...
s, both as a test problem to be solved and as a useful primitive to be used as a subroutine in other parallel algorithms.. Abstractly, a prefix sum requires only a binary associative operator ⊕, making it useful for many applications from calculating
well-separated pair decomposition In computational geometry, a well-separated pair decomposition (WSPD) of a set of points S \subset \mathbb^d, is a sequence of pairs of sets (A_i, B_i), such that each pair is well-separated, and for each two distinct points p, q \in S, there exists ...
s of points to string processing... Mathematically, the operation of taking prefix sums can be generalized from finite to infinite sequences; in that context, a prefix sum is known as a
partial sum In mathematics, a series is, roughly speaking, a description of the operation of adding infinitely many quantities, one after the other, to a given starting quantity. The study of series is a major part of calculus and its generalization, math ...
of a
series Series may refer to: People with the name * Caroline Series (born 1951), English mathematician, daughter of George Series * George Series (1920–1995), English physicist Arts, entertainment, and media Music * Series, the ordered sets used in ...
. Prefix summation or partial summation form
linear operator In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
s on the
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
s of finite or infinite sequences; their inverses are
finite difference A finite difference is a mathematical expression of the form . If a finite difference is divided by , one gets a difference quotient. The approximation of derivatives by finite differences plays a central role in finite difference methods for t ...
operators.


Scan higher order function

In
functional programming In computer science, functional programming is a programming paradigm where programs are constructed by Function application, applying and Function composition (computer science), composing Function (computer science), functions. It is a declar ...
terms, the prefix sum may be generalized to any binary operation (not just the
addition Addition (usually signified by the Plus and minus signs#Plus sign, plus symbol ) is one of the four basic Operation (mathematics), operations of arithmetic, the other three being subtraction, multiplication and Division (mathematics), division. ...
operation); the
higher order function In mathematics and computer science, a higher-order function (HOF) is a function that does at least one of the following: * takes one or more functions as arguments (i.e. a procedural parameter, which is a parameter of a procedure that is itsel ...
resulting from this generalization is called a scan, and it is closely related to the fold operation. Both the scan and the fold operations apply the given binary operation to the same sequence of values, but differ in that the scan returns the whole sequence of results from the binary operation, whereas the fold returns only the final result. For instance, the sequence of
factorial In mathematics, the factorial of a non-negative denoted is the product of all positive integers less than or equal The factorial also equals the product of n with the next smaller factorial: \begin n! &= n \times (n-1) \times (n-2) \t ...
numbers may be generated by a scan of the natural numbers using multiplication instead of addition: :


Inclusive and exclusive scans

Programming language and library implementations of scan may be either ''inclusive'' or ''exclusive''. An inclusive scan includes input when computing output (i.e., y_i = \bigoplus_^i x_j) while an exclusive scan does not (i.e., y_i = \bigoplus_^ x_j). In the latter case, implementations either leave undefined or accept a separate "" value with which to seed the scan. Either type of scan can be transformed into the other: an inclusive scan can be transformed into an exclusive scan by shifting the array produced by the scan right by one element and inserting the identity value at the left of the array. Conversely, an exclusive scan be transformed into an inclusive scan by shifting the array produced by the scan left and inserting the sum of the last element of the scan and the last element of the input array at the right of the array. The following table lists examples of the inclusive and exclusive scan functions provided by a few programming languages and libraries: :


Parallel algorithms

There are two key algorithms for computing a prefix sum in parallel. The first offers a shorter
span Span may refer to: Science, technology and engineering * Span (unit), the width of a human hand * Span (engineering), a section between two intermediate supports * Wingspan, the distance between the wingtips of a bird or aircraft * Sorbitan ester ...
and more parallelism but is not work-efficient. The second is work-efficient but requires double the span and offers less parallelism. These are presented in turn below.


Algorithm 1: Shorter span, more parallel

Hillis and
Steele Steele may refer to: Places America * Steele, Alabama, a town * Steele, Arkansas, an unincorporated community * Steele, Kentucky, an unincorporated community * Steele, Missouri, a city * Lonetree, Montana, a ghost town originally called Steele ...
present the following parallel prefix sum algorithm: : for i \gets 0 to \lceil\log_2 n\rceil - 1 do :: for j \gets 0 to n - 1 do in parallel ::: if j < 2^i then :::: x^_j \gets x^i_j ::: else :::: x^_j \gets x^i_j + x^i_ In the above, the notation x^i_j means the value of the th element of array in timestep . With a single processor this algorithm would run in time. However if the machine has at least processors to perform the inner loop in parallel, the algorithm as a whole runs in time, the number of iterations of the outer loop.


Algorithm 2: Work-efficient

A work-efficient parallel prefix sum can be computed by the following steps. #Compute the sums of consecutive pairs of items in which the first item of the pair has an even index: , , etc. #Recursively compute the prefix sum of the sequence #Express each term of the final sequence as the sum of up to two terms of these intermediate sequences: , , , , etc. After the first value, each successive number is either copied from a position half as far through the sequence, or is the previous value added to one value in the sequence. If the input sequence has steps, then the recursion continues to a depth of , which is also the bound on the parallel running time of this algorithm. The number of steps of the algorithm is , and it can be implemented on a
parallel random access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared memory architecture, shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machin ...
with processors without any asymptotic slowdown by assigning multiple indices to each processor in rounds of the algorithm for which there are more elements than processors.


Discussion

Each of the preceding algorithms runs in time. However, the former takes exactly steps, while the latter requires steps. For the 16-input examples illustrated, Algorithm 1 is 12-way parallel (49 units of work divided by a span of 4) while Algorithm 2 is only 4-way parallel (26 units of work divided by a span of 6). However, Algorithm 2 is work-efficient—it performs only a constant factor (2) of the amount of work required by the sequential algorithm—while Algorithm 1 is work-inefficient—it performs asymptotically more work (a logarithmic factor) than is required sequentially. Consequently, Algorithm 1 is likely to perform better when abundant parallelism is available, but Algorithm 2 is likely to perform better when parallelism is more limited. Parallel algorithms for prefix sums can often be generalized to other scan operations on associative binary operations, and they can also be computed efficiently on modern parallel hardware such as a
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
. The idea of building in hardware a functional unit dedicated to computing multi-parameter prefix-sum was patented by
Uzi Vishkin Uzi Vishkin (born 1953) is a computer scientist at the University of Maryland, College Park, where he is Professor of Electrical and Computer Engineering at the University of Maryland Institute for Advanced Computer Studies (UMIACS). Uzi Vishkin i ...
. Many parallel implementations follow a two pass procedure where partial prefix sums are calculated in the first pass on each processing unit; the prefix sum of these partial sums is then calculated and broadcast back to the processing units for a second pass using the now known prefix as the initial value. Asymptotically this method takes approximately two read operations and one write operation per item.


Concrete implementations of prefix sum algorithms

An implementation of a parallel prefix sum algorithm, like other parallel algorithms, has to take the parallelization architecture of the platform into account. More specifically, multiple algorithms exist which are adapted for platforms working on
shared memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
as well as algorithms which are well suited for platforms using
distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task mu ...
, relying on
message passing In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting i ...
as the only form of interprocess communication.


Shared memory: Two-level algorithm

The following algorithm assumes a
shared memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
machine model; all processing elements (PEs) have access to the same memory. A version of this algorithm is implemented in the Multi-Core Standard Template Library (MCSTL), a parallel implementation of the C++ standard template library which provides adapted versions for parallel computing of various algorithms. In order to concurrently calculate the prefix sum over n data elements with p processing elements, the data is divided into p+1 blocks, each containing \frac n elements (for simplicity we assume that p+1 divides n). Note, that although the algorithm divides the data into p+1 blocks, only p processing elements run in parallel at a time. In a first sweep, each PE calculates a local prefix sum for its block. The last block does not need to be calculated, since these prefix sums are only calculated as offsets to the prefix sums of succeeding blocks and the last block is by definition not succeeded. The p offsets which are stored in the last position of each block are accumulated in a prefix sum of their own and stored in their succeeding positions. For p being a small number, it is faster to do this sequentially, for a large p, this step could be done in parallel as well. A second sweep is performed. This time the first block does not have to be processed, since it does not need to account for the offset of a preceding block. However, in this sweep the last block is included instead and the prefix sums for each block are calculated taking the prefix sum block offsets calculated in the previous sweep into account. function prefix_sum(elements) Improvement: In case that the number of blocks are too much that makes the serial step time-consuming by deploying a single processor, the Hillis and Steele algorithm can be used to accelerate the second phase.


Distributed memory: Hypercube algorithm

The Hypercube Prefix Sum Algorithm is well adapted for
distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task mu ...
platforms and works with the exchange of messages between the processing elements. It assumes to have p=2^d processor elements (PEs) participating in the algorithm equal to the number of corners in a d-dimensional
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1- skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, ...
. Throughout the algorithm, each PE is seen as a corner in a hypothetical hyper cube with knowledge of the total prefix sum \sigma as well as the prefix sum x of all elements up to itself (according to the ordered indices among the PEs), both in its own hypercube. * The algorithm starts by assuming every PE is the single corner of a zero dimensional hyper cube and therefore \sigma and x are equal to the local prefix sum of its own elements. * The algorithm goes on by unifying hypercubes which are adjacent along one dimension. During each unification, \sigma is exchanged and aggregated between the two hyper cubes which keeps the invariant that all PEs at corners of this new hyper cube store the total prefix sum of this newly unified hyper cube in their variable \sigma. However, only the hyper cube containing the PEs with higher index also adds this \sigma to their local variable x, keeping the invariant that x only stores the value of the prefix sum of all elements at PEs with indices smaller or equal to their own index. In a d-dimensional hyper cube with 2^d PEs at the corners, the algorithm has to be repeated d times to have the 2^dzero-dimensional hyper cubes be unified into one d-dimensional hyper cube. Assuming a duplex communication model where the \sigma of two adjacent PEs in different hyper cubes can be exchanged in both directions in one communication step, this means d=\log_2 p communication startups. i := Index of own processor element (PE) m := prefix sum of local elements of this PE d := number of dimensions of the hyper cube x = m; // Invariant: The prefix sum up to this PE in the current sub cube σ = m; // Invariant: The prefix sum of all elements in the current sub cube for (k=0; k <= d-1; k++)


Large message sizes: pipelined binary tree

The Pipelined Binary Tree Algorithm is another algorithm for distributed memory platforms which is specifically well suited for large message sizes. Like the hypercube algorithm, it assumes a special communication structure. The processing elements (PEs) are hypothetically arranged in a
binary tree In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
(e.g. a Fibonacci Tree) with infix numeration according to their index within the PEs. Communication on such a tree always occurs between parent and child nodes. The infix numeration ensures that for any given PEj, the indices of all nodes reachable by its left subtree \mathbb are less than j and the indices \mathbb of all nodes in the right subtree are greater than j. The parent's index is greater than any of the indices in PEj's subtree if PEj is a left child and smaller if PEj is a right child. This allows for the following reasoning: * The local prefix sum \mathbb of the left subtree has to be aggregated to calculate PEj's local prefix sum \mathbb. * The local prefix sum \mathbb of the right subtree has to be aggregated to calculate the local prefix sum of higher level PEh which are reached on a path containing a left children connection (which means h > j). * The total prefix sum \mathbb of PEj is necessary to calculate the total prefix sums in the right subtree (e.g. \mathbb for the highest index node in the subtree). * PEj needs to include the total prefix sum \mathbb of the first higher order node which is reached via an upward path including a right children connection to calculate its total prefix sum. Note the distinction between subtree-local and total prefix sums. The points two, three and four can lead to believe they would form a circular dependency, but this is not the case. Lower level PEs might require the total prefix sum of higher level PEs to calculate their total prefix sum, but higher level PEs only require subtree local prefix sums to calculate their total prefix sum. The root node as highest level node only requires the local prefix sum of its left subtree to calculate its own prefix sum. Each PE on the path from PE0 to the root PE only requires the local prefix sum of its left subtree to calculate its own prefix sum, whereas every node on the path from PEp-1 (last PE) to the PEroot requires the total prefix sum of its parent to calculate its own total prefix sum. This leads to a two-phase algorithm: Upward Phase
Propagate the subtree local prefix sum \mathbb to its parent for each PEj. Downward phase
Propagate the exclusive (exclusive PEj as well as the PEs in its left subtree) total prefix sum \mathbb of all lower index PEs which are not included in the addressed subtree of PEj to lower level PEs in the left child subtree of PEj. Propagate the inclusive prefix sum \mathbb to the right child subtree of PEj. Note that the algorithm is run in parallel at each PE and the PEs will block upon receive until their children/parents provide them with packets. k := number of packets in a message m of a PE m @ := // Messages at the different PEs x = m @ this // Upward phase - Calculate subtree local prefix sums for j=0 to k-1: // Pipelining: For each packet of a message if hasLeftChild: blocking receive m @ left // This replaces the local m with the received m // Aggregate inclusive local prefix sum from lower index PEs x = m ⨁ x if hasRightChild: blocking receive m @ right // We do not aggregate m into the local prefix sum, since the right children are higher index PEs send x ⨁ m to parent else: send x to parent // Downward phase for j=0 to k-1: m @ this = 0 if hasParent: blocking receive m @ parent // For a left child m is the parents exclusive prefix sum, for a right child the inclusive prefix sum x = m ⨁ x send m to left // The total prefix sum of all PE's smaller than this or any PE in the left subtree send x to right // The total prefix sum of all PE's smaller or equal than this PE


=Pipelining

= If the message of length can be divided into packets and the operator ⨁ can be used on each of the corresponding message packets separately, pipelining is possible. If the algorithm is used without pipelining, there are always only two levels (the sending PEs and the receiving PEs) of the binary tree at work while all other PEs are waiting. If there are processing elements and a balanced binary tree is used, the tree has \log _p levels, the length of the path from PE_0 to PE_\mathbb is therefore \log _p - 1 which represents the maximum number of non parallel communication operations during the upward phase, likewise, the communication on the downward path is also limited to \log _p -1 startups. Assuming a communication startup time of T_\mathbb and a bytewise transmission time of T_\mathbb, upward and downward phase are limited to (2\log _p-2)(T_\mathbb + n\cdot T_\mathbb) in a non pipelined scenario. Upon division into k packets, each of size \tfrac and sending them separately, the first packet still needs (\log _p-1)\left (T_\mathbb + \frac \cdot T_\mathbb\right) to be propagated to PE_ as part of a local prefix sum and this will occur again for the last packet if k > \log_p. However, in between, all the PEs along the path can work in parallel and each third communication operation (receive left, receive right, send to parent) sends a packet to the next level, so that one phase can be completed in 2\log_p-1 + 3(k-1) communication operations and both phases together need (4\cdot\log_p-2 + 6(k-1))\left(T_\mathbb + \frac \cdot T_\mathbb\right) which is favourable for large message sizes . The algorithm can further be optimised by making use of full-duplex or telephone model communication and overlapping the upward and the downward phase.


Data structures

When a data set may be updated dynamically, it may be stored in a
Fenwick tree A Fenwick tree or binary indexed tree (BIT) is a data structure that can efficiently update elements and calculate prefix sums in a table of numbers. This structure was proposed by Boris Ryabko in 1989 with a further modification published in 1992 ...
data structure. This structure allows both the lookup of any individual prefix sum value and the modification of any array value in logarithmic time per operation. However, an earlier 1982 paper presents a data structure called Partial Sums Tree (see Section 5.1) that appears to overlap Fenwick trees; in 1982 the term prefix-sum was not yet as common as it is today. For higher-dimensional arrays, the
summed area table A summed-area table is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid. In the image processing domain, it is also known as an integral image. It was introduced to computer g ...
provides a data structure based on prefix sums for computing sums of arbitrary rectangular subarrays. This can be a helpful primitive in image convolution operations.


Applications

Counting sort In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess dis ...
is an
integer sorting In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numb ...
algorithm that uses the prefix sum of a histogram of key frequencies to calculate the position of each key in the sorted output array. It runs in linear time for integer keys that are smaller than the number of items, and is frequently used as part of
radix sort In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process i ...
, a fast algorithm for sorting integers that are less restricted in magnitude.
List ranking In parallel algorithms, the list ranking problem involves determining the position, or rank, of each item in a linked list. That is, the first item in the list should be assigned the number 1, the second item in the list should be assigned the numb ...
, the problem of transforming a
linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...
into an
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
that represents the same sequence of items, can be viewed as computing a prefix sum on the sequence 1, 1, 1, ... and then mapping each item to the array position given by its prefix sum value; by combining list ranking, prefix sums, and
Euler tour In graph theory, an Eulerian trail (or Eulerian path) is a trail in a finite graph that visits every edge exactly once (allowing for revisiting vertices). Similarly, an Eulerian circuit or Eulerian cycle is an Eulerian trail that starts and ends ...
s, many important problems on
trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...
may be solved by efficient parallel algorithms.. An early application of parallel prefix sum algorithms was in the design of
binary adder Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...
s, Boolean circuits that can add two -bit binary numbers. In this application, the sequence of carry bits of the addition can be represented as a scan operation on the sequence of pairs of input bits, using the
majority function In Boolean logic, the majority function (also called the median operator) is the Boolean function that evaluates to false when half or more arguments are false and true otherwise, i.e. the value of the function equals the value of the majority of t ...
to combine the previous carry with these two bits. Each bit of the output number can then be found as the
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
of two input bits with the corresponding carry bit. By using a circuit that performs the operations of the parallel prefix sum algorithm, it is possible to design an adder that uses logic gates and time steps... English translation, "On the algorithmic complexity of discrete functions", ''Soviet Physics Doklady'' 7: 589–591 1963.. English translation in ''Syst. Theory Res.'' 19; 105–122, 1970. In the
parallel random access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared memory architecture, shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machin ...
model of computing, prefix sums can be used to simulate parallel algorithms that assume the ability for multiple processors to access the same memory cell at the same time, on parallel machines that forbid simultaneous access. By means of a
sorting network In computer science, comparator networks are abstract devices built up of a fixed number of "wires", carrying values, and comparator modules that connect pairs of wires, swapping the values on the wires if they are not in a desired order. Such net ...
, a set of parallel memory access requests can be ordered into a sequence such that accesses to the same cell are contiguous within the sequence; scan operations can then be used to determine which of the accesses succeed in writing to their requested cells, and to distribute the results of memory read operations to multiple processors that request the same result. In
Guy Blelloch Guy Edward Blelloch is a professor of computer science at Carnegie Mellon University. He is known for his work in parallel programming and parallel algorithms. He teaches the 15-853: Algorithms in the Real World course, the 15-492: Parallel Algor ...
's Ph.D. thesis, parallel prefix operations form part of the formalization of the
data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...
model provided by machines such as the
Connection Machine A Connection Machine (CM) is a member of a series of massively parallel supercomputers that grew out of doctoral research on alternatives to the traditional von Neumann architecture of computers by Danny Hillis at Massachusetts Institute of Techno ...
. The Connection Machine CM-1 and CM-2 provided a hypercubic network on which the Algorithm 1 above could be implemented, whereas the CM-5 provided a dedicated network to implement Algorithm 2. In the construction of
Gray code The reflected binary code (RBC), also known as reflected binary (RB) or Gray code after Frank Gray, is an ordering of the binary numeral system such that two successive values differ in only one bit (binary digit). For example, the representati ...
s, sequences of binary values with the property that consecutive sequence values differ from each other in a single bit position, a number can be converted into the Gray code value at position of the sequence simply by taking the
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
of and (the number formed by shifting right by a single bit position). The reverse operation, decoding a Gray-coded value into a binary number, is more complicated, but can be expressed as the prefix sum of the bits of , where each summation operation within the prefix sum is performed modulo two. A prefix sum of this type may be performed efficiently using the bitwise Boolean operations available on modern computers, by computing the
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
of with each of the numbers formed by shifting to the left by a number of bits that is a power of two. Parallel prefix (using multiplication as the underlying associative operation) can also be used to build fast algorithms for parallel
polynomial interpolation In numerical analysis, polynomial interpolation is the interpolation of a given data set by the polynomial of lowest possible degree that passes through the points of the dataset. Given a set of data points (x_0,y_0), \ldots, (x_n,y_n), with no ...
. In particular, it can be used to compute the
divided difference In mathematics, divided differences is an algorithm, historically used for computing tables of logarithms and trigonometric functions. Charles Babbage's difference engine, an early mechanical calculator, was designed to use this algorithm in it ...
coefficients of the Newton form of the interpolation polynomial. This prefix based approach can also be used to obtain the generalized divided differences for (confluent)
Hermite interpolation In numerical analysis, Hermite interpolation, named after Charles Hermite, is a method of polynomial interpolation, which generalizes Lagrange interpolation. Lagrange interpolation allows computing a polynomial of degree less than that takes the s ...
as well as for parallel algorithms for Vandermonde systems.


See also

*
General-purpose computing on graphics processing units General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...
*
Summed-area table A summed-area table is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid. In the image processing domain, it is also known as an integral image. It was introduced to computer g ...


References


External links

*{{mathworld, title=Cumulative Sum, urlname=CumulativeSum Concurrent algorithms Higher-order functions