Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory

symmetric multiprocessing Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all ...

(SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly ''dynamic scheduling'' of the

dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dataf ...

instruction sequences derived from the formerly sequential program. The BMDFM dynamic scheduling subsystem performs a

(SMP)

emulation Emulation may refer to: *Emulation (computing), imitation of behavior of a computer or other electronic system with the help of another type of system :*Video game console emulator, software which emulates video game consoles *Gaussian process em ...

of a ''tagged-token

machine'' to provide the transparent dataflow semantics for the applications. No directives for parallel execution are needed.

Background

Current parallel shared memory SMPs are complex machines, where a large number of architectural aspects must be addressed simultaneously to achieve high performance. Recent commodity SMP machines for technical computing can have many tightly coupled cores (good examples are SMP machines based on multi-core processors from

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...

(

Core Core or cores may refer to: Science and technology * Core (anatomy), everything except the appendages * Core (manufacturing), used in casting and molding * Core (optical fiber), the signal-carrying portion of an optical fiber * Core, the central ...

Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...

) or IBM (

Power Power most often refers to: * Power (physics), meaning "rate of doing work" ** Engine power, the power put out by an engine ** Electric power * Power (social and political), the ability to influence people or events ** Abusive power Power may a ...

)). The number of cores per SMP node is planned to double every few years according to computer makers' announcements.

Multi-core processor A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions (such a ...

s are intended to exploit a thread-level parallelism, identified by software. Hence, the most challenging task is to find an efficient way to harness power of multi-core processors for processing an application program in parallel. Existent OpenMP paradigm of the static parallelization with a fork-join runtime library works pretty well for loop-intensive regular array-based computations only, however, compile-time parallelization methods are weak in general and almost inapplicable for irregular applications: * There are many operations that take a non-deterministic amount of time making it difficult to know exactly when certain pieces of data will become available. * A memory hierarchy with multi-level caches has unpredictable memory access latencies. * A multi-user mode other people's codes can use up resources or slow down a part of the computation in a way that the compiler cannot account for. * Compile-time inter-procedural and cross-conditional optimizations are hard (very often impossible) because compilers cannot figure out which way a conditional will go or cannot optimize across a function call.

Transparent dataflow semantics of BMDFM

The BMDFM technology mainly uses dynamic scheduling to exploit parallelism of an application program, thus, BMDFM avoids mentioned disadvantages of the compile-time methods. BMDFM is a parallel programming environment for multi-core SMP that provides: *Conventional programming paradigm requiring no directives for parallel execution. *Transparent (implicit) exploitation of parallelism in a natural and load balanced manner using all available multi-core processors in the system automatically. BMDFM combines the advantages of known architectural principles into a single hybrid architecture that is able to exploit

implicit parallelism In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language's constructs. A ...

of the applications having negligible dynamic scheduling overhead and no bottlenecks. Mainly, the basic dataflow principle is used. The dataflow principle says: "An instruction or a function can be executed as soon as all its arguments are ready. A dataflow machine manages the tags for every piece of data at runtime. Data is marked with ready tag when data has been computed. Instructions with ready arguments get executed marking their result data ready". The main feature of BMDFM is to provide a conventional programming paradigm at the top level, so-called transparent dataflow semantics. A user understands BMDFM as a

virtual machine In computing, a virtual machine (VM) is the virtualization/emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hardw ...

(VM), which runs all statements of an application program in parallel, having all parallelizing and synchronizing mechanisms fully transparent. The statements of an application program are normal operators, of which any single threaded program might consist: they include variable assignments, conditional processing, loops, function calls, etc. Suppose we have the code fragment shown below: (setq a (foo0 i)) # a = foo0(i); (setq b (foo1 (+ i 1))) # b = foo1(i+1); (setq b (++ b)) # b++; (outf "a = %d\n" a) # printf("a = %d\n", a); (outf "b = %d\n" b) # printf("b = %d\n", b); The two first statements are independent, so a dataflow engine of BMDFM can run them on different processors or processor's cores. The two last statements can also run in parallel but only after "a" and "b" are computed. The dataflow engine recognizes dependencies automatically because of its ability to build a dataflow graph dynamically at runtime. Additionally, the dataflow engine correctly orders the output stream to output the results sequentially. Thus even after the out-of-order processing the results will appear in a natural way. Suppose that above code fragment now is nested in a loop: (for i 1 1 N (progn # for (i = 1; i <= N; i++) The dataflow engine of BMDFM will keep variables "a" and "b" under unique contexts for each iteration. Actually, these are different copies of the variables. A context variable exists until it is referenced by instruction consumers. Later non-referenced contexts will be garbage collected at runtime. Therefore, the dataflow engine can exploit both local parallelism within the iteration and global parallelism as well running multiple iterations simultaneously.

Architecture

BMDFM is a convenient parallel programming environment and an efficient runtime engine for multi-core SMP due to the MIMD unification of several architectural paradigms (von-Neumann, SMP and dataflow): * At first, it is a hybrid dataflow emulator running multithreadedly on commodity SMP. The SMP ensures MIMD while dataflow exploits implicit parallelism. * At second, it is a hybrid multithreaded dataflow runtime engine controlled by a von-Neumann front-end VM. The dataflow runtime engine executes tagged-token contextual parallel instructions (opposite to the restricted fork-join paradigm) while the von-Neumann front-end VM initializes contexts and feeds the dataflow runtime engine with marshaled clusters of instructions. * At third, it is a hybrid of static and dynamic parallelizing. The von-Neumann front-end VM tries statically to split an application into parallel marshaled clusters of instructions while the dataflow runtime engine complements the static parallelizing methods dynamically. BMDFM is intended for use in a role of the parallel runtime engine (instead of conventional fork-join runtime library) able to run irregular applications automatically in parallel. Due to the transparent dataflow semantics on top, BMDFM is a simple parallelization technique for application programmers and, at the same time, is a much better parallel programming and compiling technology for multi-core SMP computers. The basic concept of BMDFM relies on underlying commodity SMP hardware, which is available on the market. Normally, SMP vendors provide their own SMP Operating System (OS) with an SVR4/POSIX UNIX interface (Linux, HP-UX, SunOS/Solaris, Tru64OSF1, IRIX, AIX, BSD, MacOS, etc.). On top of an SMP OS, the multithreaded dataflow runtime engine performs a software emulation of the dataflow machine. Such a virtual machine has interfaces to the virtual machine language and to C providing the transparent dataflow semantics for conventional programming. BMDFM is built as a hybrid of several architectural principles: * MIMD (Multiple Instruction Streams, Multiple Data Streams), which is sustained by commodity SMP. * Implicit parallel execution is ensured by dataflow emulation. * Von-Neumann computational principle is good to implement the Front-end Control Virtual Machine. BMDFM arch

An application program (input sequential program) is processed in three stages: preliminary code reorganization (code reorganizer), static scheduling of the statements (static scheduler) and compiling/loading (compiler, loader). The output after the static scheduling stages is a multiple clusters flow that feeds the multithreaded engine via the interface designed in a way to avoid bottlenecks. The multiple clusters flow can be thought of as a compiled input program split into marshaled clusters, in which all addresses are resolved and extended with context information. Splitting into marshaled clusters allows loading them multithreadedly. Context information lets iterations be processed in parallel. Listener thread orders the output stream after the out-of-order processing. The BMDFM dynamic scheduling subsystem is an efficient SMP emulator of the tagged-token dataflow machine. The Shared Memory Pool is divided in three main parts: ''

input/output In computing, input/output (I/O, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals ...

ring buffer port'' (IORBP), ''data buffer'' (DB), and ''operation queue'' (OQ). The ''front-end control

'' schedules an input application program statically and puts clustered instructions and data of the input program into the IORBP. The ring buffer service processes (IORBP PROC) move data into the DB and instructions into the OQ. The operation queue service processes (OQ PROC) tag the instructions as ready for execution if the required operands' data is accessible. Execution processes (CPU PROC) execute instructions, which are tagged as ready and output computed data into the DB or to the IORBP. Additionally, IORBP PROC and OQ PROC are responsible for freeing memory after contexts have been processed. The context is a special unique identifier representing a copy of data within different iteration bodies accordingly to the tagged-token dataflow architecture. This allows the dynamic scheduler to handle several iterations in parallel. Running under an SMP OS, the processes will occupy all available real machine processors and processor cores. In order to allow several processes accessing the same data concurrently, the BMDFM dynamic scheduler locks objects in the shared memory pool via SVR4/POSIX semaphore operations. Locking policy provides multiple read-only access and exclusive access for modification.

Supported platforms

Every machine supporting

ANSI C ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and the ...

and

POSIX The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system- and user-level application programming interf ...

;

UNIX System V Unix System V (pronounced: "System Five") is one of the first commercial versions of the Unix operating system. It was originally developed by AT&T and first released in 1983. Four major versions of System V were released, numbered 1, 2, 3, an ...

(SVR4) may run BMDFM. BMDFM is provided as full multi-threaded versions for: *

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...

: Linux/32, FreeBSD/32, OpenBSD/32, NetBSD/32, MacOS/32, SunOS/32, UnixWare/32, Minix/32, Android/32, Win-Cygwin/32, Win-UWIN/32, Win-SFU-SUA/32; *

x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...

: Linux/64, FreeBSD/64, OpenBSD/64, NetBSD/64, MacOS/64, SunOS/64, Android/64, Win-Cygwin/64; *

VAX VAX (an acronym for Virtual Address eXtension) is a series of computers featuring a 32-bit instruction set architecture (ISA) and virtual memory that was developed and sold by Digital Equipment Corporation (DEC) in the late 20th century. The V ...

: Ultrix/32; *

Alpha Alpha (uppercase , lowercase ; grc, ἄλφα, ''álpha'', or ell, άλφα, álfa) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of one. Alpha is derived from the Phoenician letter aleph , whic ...

: Tru64OSF1/64, Linux/64, FreeBSD/64, OpenBSD/64; *

IA-64 IA-64 (Intel Itanium architecture) is the instruction set architecture (ISA) of the Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in coll ...

: HP-UX/32, HP-UX/64, Linux/64, FreeBSD/64; * XeonPhiMIC: Linux/64; * MCST-Elbrus: Linux/32, Linux/64; *

PA-RISC PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as ...

: HP-UX/32, HP-UX/64, Linux/32; *

SPARC SPARC (Scalable Processor Architecture) is a reduced instruction set computer (RISC) instruction set architecture originally developed by Sun Microsystems. Its design was strongly influenced by the experimental Berkeley RISC system developed ...

: SunOS/32, SunOS/64, Linux/32, Linux/64, FreeBSD/64, OpenBSD/64; * MIPS: IRIX/32, IRIX/64, Linux/32, Linux/64; * MIPSel: Linux/32, Linux/64, Android/32, Android/64; *

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

: AIX/32, AIX/64, MacOS/32, MacOS/64, Linux/32, Linux/64, FreeBSD/32, FreeBSD/64; *

le: Linux/32, Linux/64; *

S/390 The IBM System/390 is a discontinued mainframe product family implementing the ESA/390, the fifth generation of the System/360 instruction set architecture. The first computers to use the ESA/390 were the Enterprise System/9000 (ES/90 ...

: Linux/32, Linux/64; *

M68000 The Motorola 68000 (sometimes shortened to Motorola 68k or m68k and usually pronounced "sixty-eight-thousand") is a 16/32-bit complex instruction set computer (CISC) microprocessor, introduced in 1979 by Motorola Semiconductor Products Sector ...

: Linux/32; *

ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...

: Linux/32, Linux/64, FreeBSD/64, Android/32, Android/64, MacOS/64; * ARMbe: Linux/64; *

RISC-V RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on estab ...

: Linux/32, Linux/64; * and a limited single-threaded version for

: Win/32.

References

External links