Parallel External Memory (Model)
   HOME

TheInfoList



OR:

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory
abstract machine An abstract machine is a computer science theoretical model that allows for a detailed and precise analysis of how a computer system functions. It is analogous to a mathematical function in that it receives inputs and produces outputs based on pre ...
. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the
parallel random-access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confused ...
(PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory. __TOC__


Model


Definition

The PEM model is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of P processors and a two-level
memory hierarchy In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlli ...
. This memory hierarchy consists of a large external memory (main memory) of size N and P small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size M which is partitioned in blocks of size B. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size B.


I/O complexity

The complexity measure of the PEM model is the I/O complexity, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if P processors load parallelly a data block of size B form the main memory into their caches, it is considered as an I/O complexity of O(1) not O(P). A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.


Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts occur. Like in the PRAM model, three different variations of this problem are considered: * Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently. * Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time. * Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time. The following two algorithms solve the CREW and EREW problem if P \leq B processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of P parallel block transfers. A second approach needs O(\log(P)) parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round P processors combine their blocks into P/2 blocks. Then P/2 processors combine the P/2 blocks into P/4. This procedure is continued until all the data is combined in one block.


Comparison to other models


Examples


Multiway partitioning

Let M=\ be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition of A is a set \Pi=\ , where \cup_^d A_i = A and A_i\cap A_j=\emptyset for 1\leq i. A_i is called the i-th bucket. The number of elements in A_i is greater than m_ and smaller than m_^2. In the following algorithm the input is partitioned into N/P-sized contiguous segments S_1,...,S_P in main memory. The processor i primarily works on the segment S_i. The multiway partitioning algorithm (PEM_DIST_SORT) uses a PEM
prefix sum In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers is a second sequence of numbers , the sums of prefixes ( running totals) of the input sequence: : : : :... For instance, the prefix sums ...
algorithm to calculate the prefix sum with the optimal O\left(\frac + \log(P)\right) I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments S_i for each processor i in parallel do Read the vector of pivots M into the cache. Partition S_i into d buckets and let vector M_i=\ be the number of items in each bucket. end for Run PEM prefix sum on the set of vectors \ simultaneously. // Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements S_i into memory locations offset appropriately by M_ and M_. end for Using the prefix sums stored in M_P the last processor P calculates the vector B of bucket sizes and returns it. If the vector of d=O\left(\frac\right) pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with O\left(\frac + \left\lceil \frac \right\rceil>\log(P)+d\log(B)\right) I/O complexity. The content of the final buckets have to be located in contiguous memory.


Selection

The
selection problem In computer science, a selection algorithm is an algorithm for finding the ''k''th smallest number in a list or array; such a number is called the ''k''th ''order statistic''. This includes the cases of finding the minimum, maximum, and median e ...
is about finding the k-th smallest item in an unordered list A of size N. The following code makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in O(\log N), and SELECT, which is a cache optimal single-processor selection algorithm. if N \leq P then \texttt(A,P) return A /math> end if //Find median of each S_i for each processor i in parallel do m_i = \texttt(S_i, \frac) end for // Sort medians \texttt(\lbrace m_1, \dots, m_2 \rbrace, P) // Partition around median of medians t = \texttt(A, m_,P) if k \leq t then return \texttt(A :t P, k) else return \texttt(A +1:N P, k-t) end if Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of: O(\frac + \log (PB) \cdot \log(\frac))


Distribution sort

Distribution sort In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is import ...
partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list. If P = 1 the task is delegated to a cache-optimal single-processor sorting algorithm. Otherwise the following algorithm is used: // Sample \tfrac elements from A for each processor i in parallel do if M < , S_i, then d = M/B Load S_i in M-sized pages and sort pages individually else d = , S_i, Load and sort S_i as single page end if Pick every \sqrt/4'th element from each sorted memory page into contiguous vector R^i of samples end for in parallel do Combine vectors R^1 \dots R^P into a single contiguous vector \mathcal Make \sqrt copies of \mathcal: \mathcal_1 \dots \mathcal_ end do // Find \sqrt pivots \mathcal /math> for j = 1 to \sqrt in parallel do \mathcal = \texttt(\mathcal_i, \tfrac, \tfrac) end for Pack pivots in contiguous array \mathcal // Partition Aaround pivots into buckets \mathcal \mathcal = \texttt(A :N\mathcal,\sqrt,P) // Recursively sort buckets for j = 1 to \sqrt + 1 in parallel do recursively call \texttt on bucket jof size \mathcal /math> using O \left( \left \lceil \tfrac \right \rceil \right) processors responsible for elements in bucket j end for The I/O complexity of PEMDISTSORT is: O \left( \left \lceil \frac \right \rceil \left ( \log_d P + \log_ \frac \right ) + f(N,P,d) \cdot \log_d P \right) where f(N,P,d) = O \left ( \log \frac \log \frac + \left \lceil \frac \log P + \sqrt \log B \right \rceil \right ) If the number of processors is chosen that f(N,P,d) = O\left ( \left \lceil \tfrac \right \rceil \right )and M < B^ the I/O complexity is then: O \left ( \frac \log_ \frac \right )


Other PEM algorithms

Where \textrm_P(N) is the time it takes to sort N items with P processors in the PEM model.


See also

*
Parallel random-access machine In computer science, a parallel random-access machine (parallel RAM or PRAM) is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM) (not to be confused ...
(PRAM) *
Random-access machine In computer science, random-access machine (RAM) is an abstract machine in the general class of register machines. The RAM is very similar to the counter machine but with the added capability of 'indirect addressing' of its registers. Like the cou ...
(RAM) * External memory (EM)


References

{{Parallel Computing Algorithms Models of computation Analysis of parallel algorithms External memory algorithms Cache (computing)