DOPIPE parallelism is a method to perform

loop-level parallelism Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random acce ...

by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends upon the programmers' ability to make best use of this concept. It also depends upon factors like identifying and separating the independent tasks and executing them parallelly.

Background

The main purpose of employing

is to search and split sequential tasks of a program and convert them into parallel tasks without any prior information about the

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

. Parts of data that are recurring and consume significant amount of execution time are good candidates for

. Some common applications of

are found in mathematical analysis that uses multiple-dimension matrices which are iterated in nested loops. There are different kind of parallelization techniques which are used on the basis of data storage overhead, degree of parallelization and

data dependencies A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements (or instructions) is c ...

. Some of the known techniques are: DOALL, DOACROSS and DOPIPE. DOALL: This technique is used where we can parallelize each iteration of the loop without any interaction between the iterations. Hence, the overall run-time gets reduced from N * T (for a serial processor, where T is the execution time for each iteration) to only T (since all the N iterations are executed in parallel). DOACROSS: This technique is used wherever there is a possibility for data dependencies. Hence, we parallelize tasks in such a manner that all the data independent tasks are executed in parallel, but the dependent ones are executed sequentially. There is a degree of synchronization used to sync the dependent tasks across parallel processors.

Description

DOPIPE is a pipelined parallelization technique that is used in programs where each element produced during each iteration is consumed in the later iteration. The following example shows how to implement the DOPIPE technique to reduce the total execution time by breaking the tasks inside the loop and executing them in a pipelined manner. The breaking into tasks takes place in such a way that all the dependencies within the loop are unidirectional, i.e. the following iteration does not depend on the previous iteration.

Example

The program below shows a

pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...

for DOPIPE parallelization. In this code, we see that there are three tasks (F0, F1 and F2) inside a loop iterating over j for 1 to N. Following is a list of dependencies in the code: F1 → ^T F1 +1 implies that statement F1 in iteration j+1 must be executed after statement F1 in iteration j. This is also known as true dependency. F1 → ^T F2 implies that statement F2 in iteration j must be executed after statement F1 in iteration j.

for (j=1; j<=N; j++)

If this code would have been executed sequentially, then the total time consumed would be equal to N * (T_F0 + T_F1 + T_F2), where T_F0, T_F1 and T_F2 denote execution time for functions F0, F1 and F2 respectively per iteration. Now, if we parallelize the loop by pipelining the statements inside the loop in the following manner:

for (j=1; j<=N; j++) 

for (j=1; j<=N; j++) 

for (j=1; j<=N; j++) {
    wait(j);                            // It waits till the F1 completes and produces the value z to be used by F2
    F2: y = z * w 
}

Since, F0 is an independent function, i.e. it does not have any loop-carried dependency (no dependence on j+1 or j-1 iterations). Neither it has any dependency across other statements in the loop. Hence, we can completely separate out this function and run it parallelly using DOALL parallelism. On the other hand, Statements F1 and F2 are dependent (explained above), therefore we split them into two different loops and execute them in a pipelined fashion. We use post(j) and wait(j) to synchronize between F1 and F2 loops. Starting from the first iteration of j, statement F1 gets executed in T_F1 time. Meanwhile, F2 is not getting executed since it is waiting for the value

z /code> to be produced by F1. When F1 completes its execution for iteration j, it posts the value using post(j). After waiting for F1's execution, using wait(j), F2 starts its execution since it has the value z /code> available for use. Also, since F1's execution is not restricted by F2, hence F1 executes j+1 simultaneously. The figure below shows the execution timeline of all the statements.

From the figure, we see that the total time to execute F0 is T_F0, since all the iterations of F0 are executed in parallel. While for F1 and F2, the total execution time is equal to N * T_F1 + T_F2 (considering negligible synchronization time).

This is considerably less than the time obtained during sequential execution.

  Comparison with other models 

DOALL  parallelism mainly works on principle of divide and conquer. Here all the tasks run in different iterations that use unique set of data. So the problem with this implementation is that when large amount of data works is computed together, a large cache 

Cache, caching, or caché may refer to:
 Places United States
*  Cache, Idaho, an unincorporated community
* Cache, Illinois, an unincorporated community
* Cache, Oklahoma, a city in Comanche County
* Cache, Utah, Cache County, Utah
*  Cache Count ...
 space is needed to work for different  threads. Since there is no  dependencies in the  threads, there is no overhead for the inter - thread communication.

While in DOPIPE, there is a synchronization overhead between the threads. But, due to its pipelined structure, it requires less cache space because the data produced is immediately consumed by the consumer.

  See also 

* Parallel computing 



Parallel computing is a type of  computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different fo ...

*  Loop level parallelism
* Task parallelism 

Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing  tasks—concurrent ...

* Data dependency A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements (or instructions) is c ...

* OpenMP 




OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in  C, C++, and  Fortran, on many platforms,   instruction-set architectures and operating syste ...

* Automatic parallelization 
Automatic may refer to:


  Music   Bands 
* Automatic (band), Australian rock band
* Automatic (American band), American rock band
* The Automatic, a Welsh alternative rock band
  Albums 
* Automatic (Jack Bruce album), ''Automatic'' (Jack Bruce a ...

* Thread (computing) 




In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and  processes dif ...

* Cache (computing) 



In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...


  References 



 Computer architecture
 Parallel computing