Automatic Parallelizing
   HOME

TheInfoList



OR:

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential
code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...
into
multi-threaded In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide multiple threads of execution. Overview The multithreading paradigm has become more popular a ...
and/or vectorized code in order to use multiple processors simultaneously in a shared-memory
multiprocessor Multiprocessing (MP) is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. The ...
( SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex
program analysis In computer science, program analysis is the process of analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program optimization an ...
and the best approach may depend upon parameter values that are not known at compilation time. The programming control structures on which autoparallelization places the most focus are loops, because, in general, most of the execution time of a program takes place inside some form of loop. There are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-threading. For example, consider a loop that on each iteration applies a hundred operations, and runs for a thousand iterations. This can be thought of as a grid of 100 columns by 1000 rows, a total of 100,000 operations. Cyclic multi-threading assigns each row to a different thread. Pipelined multi-threading assigns each column to a different thread.


Automatic parallelization technique


Parse

This is the first stage where the scanner will read the input source files to identify all static and extern usages. Each line in the file will be checked against pre-defined patterns to segregate into tokens. These tokens will be stored in a file which will be used later by the grammar engine. The grammar engine will check patterns of tokens that match with pre-defined rules to identify variables, loops, control statements, functions etc. in the code.


Analyze

The
analyzer An analyser (British English) or analyzer (American English; see spelling differences) is a tool used to analyze data. For example, a gas analyzer tool is used to analyze gases. It examines the given data and tries to find patterns and relationsh ...
is used to identify sections of code that can be executed concurrently. The analyzer uses the static data information provided by the scanner-parser. The analyzer will first find all the totally independent functions and mark them as individual tasks. The analyzer then finds which tasks have dependencies.


Schedule

The
scheduler A schedule (, ) or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things ...
will list all the tasks and their dependencies on each other in terms of execution and start times. The scheduler will produce the optimal schedule in terms of number of processors to be used or the total execution time for the application.


Code Generation

The
scheduler A schedule (, ) or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things ...
will generate a list of all the tasks and the details of the cores on which they will execute along with the time that they will execute for. The code Generator will insert special constructs in the code that will be read during execution by the scheduler. These constructs will instruct the scheduler on which core a particular task will execute along with the start and end times.


Cyclic multi-threading

A cyclic multi-threading parallelizing compiler tries to split up a loop so that each
iteration Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...
can be executed on a separate processor concurrently.


Compiler parallelization analysis

The ''compiler'' usually conducts two passes of analysis before actual parallelization in order to determine the following: * Is it safe to parallelize the loop? Answering this question needs accurate
dependence analysis In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement ''S2'' depends on ''S1'' if ''S1'' must be executed before ''S2''. Broadly, there are two classes of depend ...
and alias analysis * Is it worthwhile to parallelize it? This answer requires a reliable estimation (modeling) of the program workload and the capacity of the parallel system. The first pass of the compiler performs a data dependence analysis of the loop to determine whether each iteration of the loop can be executed independently of the others. Data dependence can sometimes be dealt with, but it may incur additional overhead in the form of
message passing In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting ...
, synchronization of shared memory, or some other method of processor communication. The second pass attempts to justify the parallelization effort by comparing the theoretical execution time of the code after parallelization to the code's sequential execution time. Somewhat counterintuitively, code does not always benefit from parallel execution. The extra overhead that can be associated with using multiple processors can eat into the potential speedup of parallelized code.


Example

A loop is called DOALL if all of its iterations, in any given invocation, can be executed concurrently. The Fortran code below is DOALL, and can be auto-parallelized by a compiler because each iteration is independent of the others, and the final result of array z will be correct regardless of the execution order of the other iterations. do i = 1, n z(i) = x(i) + y(i) enddo There are many pleasingly parallel problems that have such DOALL loops. For example, when rendering a ray-traced movie, each frame of the movie can be independently rendered, and each pixel of a single frame may be independently rendered. On the other hand, the following code cannot be auto-parallelized, because the value of z(i) depends on the result of the previous iteration, z(i - 1). do i = 2, n z(i) = z(i - 1)*2 enddo This does not mean that the code cannot be parallelized. Indeed, it is equivalent to the DOALL loop do i = 2, n z(i) = z(1)*2**(i - 1) enddo However, current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and it is questionable whether this code would benefit from parallelization in the first place.


Pipelined multi-threading

A pipelined multi-threading parallelizing compiler tries to break up the sequence of operations inside a loop into a series of code blocks, such that each code block can be executed on separate processors concurrently. There are many pleasingly parallel problems that have such relatively independent code blocks, in particular systems using
pipes and filters In software engineering, a pipeline consists of a chain of processing elements ( processes, threads, coroutines, functions, ''etc.''), arranged so that the output of each element is the input of the next. The concept is analogous to a physical ...
. For example, when producing live broadcast television, the following tasks must be performed many times a second: # Read a frame of raw pixel data from the image sensor, # Do MPEG
motion compensation Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video ...
on the raw data, # Entropy compress the motion vectors and other data, # Break up the compressed data into packets, # Add the appropriate error correction and do a FFT to convert the data packets into
COFDM In telecommunications, orthogonal frequency-division multiplexing (OFDM) is a type of digital transmission used in digital modulation for encoding digital (binary) data on multiple Carrier wave, carrier frequencies. OFDM has developed into a popul ...
signals, and # Send the COFDM signals out the TV antenna. A pipelined multi-threading parallelizing compiler could assign each of these six operations to a different processor, perhaps arranged in a
systolic array In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received fro ...
, inserting the appropriate code to forward the output of one processor to the next processor. Recent research focuses on using the power of GPU's and multicore systems to compute such independent code blocks( or simply independent iterations of a loop) at runtime. The memory accessed (whether direct or indirect) can be simply marked for different iterations of a loop and can be compared for dependency detection. Using this information, the iterations are grouped into levels such that iterations belonging to the same level are independent of each other, and can be executed in parallel.


Difficulties

Automatic parallelization by compilers or tools is very difficult due to the following reasons: * dependence analysis is hard for code that uses indirect addressing, pointers, recursion, or indirect function calls because it is difficult to detect such dependencies at compile time; * loops have an unknown number of iterations; * accesses to global resources are difficult to coordinate in terms of memory allocation, I/O, and shared variables; * ''irregular algorithms'' that use input-dependent indirection interfere with compile-time analysis and optimization.


Workaround

Due to the inherent difficulties in full automatic parallelization, several easier approaches exist to get a parallel program in higher quality. One of these is to allow programmers to add "hints" to their programs to guide compiler parallelization, such as HPF for
distributed memory In computer science, distributed memory refers to a Multiprocessing, multiprocessor computer system in which each Central processing unit, processor has its own private Computer memory, memory. Computational tasks can only operate on local data ...
systems and
OpenMP OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, ...
or
OpenHMPP OpenHMPP (HMPP for Hybrid Multicore Parallel Programming) - programming standard for heterogeneous computing. Based on a set of compiler directives, standard is a programming model designed to handle hardware accelerators without the complexity a ...
for shared memory systems. Another approach is to build an interactive system between programmers and parallelizing tools/compilers. Notable examples are Vector Fabrics' Pareon, SUIF Explorer (The Stanford University Intermediate Format compiler), the Polaris compiler, and ParaWise (formally CAPTools). Finally, another approach is hardware-supported
speculative multithreading Thread Level Speculation (TLS), also known as Speculative Multi-threading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal ex ...
.


Parallelizing compilers and tools

Most research
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
s for automatic parallelization consider Fortran programs, because Fortran makes stronger guarantees about
aliasing In signal processing and related disciplines, aliasing is a phenomenon that a reconstructed signal from samples of the original signal contains low frequency components that are not present in the original one. This is caused when, in the ori ...
than languages such as C. Typical examples are:
Paradigm compiler



Rice Fortran D compiler
* SUIF compiler
Vienna Fortran compiler
Recently, Aubert, Rubiano, Rusch, and Seiller used a dependency analysis technique to automatically parallelise loops in C code.


See also

* Loop nest optimization *
Parallelization contract The parallelization contract or PACT programming model is a generalization of the MapReduce programming model and uses second order functions to perform concurrent computations on large (Petabyte The byte is a unit of digital information th ...
* Polytope model also known as Polyhedral model *
Scalable parallelism Software is said to exhibit scalable parallelism if it can make use of additional processors to solve larger problems, i.e. this term refers to software for which Gustafson's law holds. Consider a program whose execution time is dominated by one or ...
* BMDFM * Vectorization * SequenceL


References


Further reading

* (NB. Uses the term ''Occam transpiler'' as a synonym for a
source-to-source compiler A source-to-source translator, source-to-source compiler (S2S compiler), transcompiler, or transpiler is a type of translator that takes the source code of a program written in a programming language as its input and produces an equivalent so ...
working as a pre-processor that takes a normal occam program as input and derives a new occam source code as output with link-to-channel assignments etc. added to it thereby '' configuring'' it for parallel processing to run as efficient as possible on a network of
transputer The transputer is a series of pioneering microprocessors from the 1980s, intended for parallel computing. To support this, each transputer had its own integrated memory and serial communication links to exchange data with other transputers. ...
s.) {{DEFAULTSORT:Automatic Parallelization Articles with example Fortran code Compiler optimizations Parallel computing