Message passing in computer clusters
   HOME

TheInfoList



OR:

Message passing In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting ...
is an inherent element of all
computer cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newes ...
s. All computer clusters, ranging from homemade
Beowulf ''Beowulf'' (; ) is an Old English poetry, Old English poem, an Epic poetry, epic in the tradition of Germanic heroic legend consisting of 3,182 Alliterative verse, alliterative lines. It is one of the most important and List of translat ...
s to some of the fastest
supercomputers A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instru ...
in the world, rely on message passing to coordinate the activities of the many nodes they encompass.''Beowulf Cluster Computing With Windows'' by Thomas Lawrence Sterling 2001 MIT Press pages 7–9 Message passing in computer clusters built with commodity
server Server may refer to: Computing *Server (computing), a computer program or a device that provides requested information for other programs or devices, called clients. Role * Waiting staff, those who work at a restaurant or a bar attending custome ...
s and switches is used by virtually every internet service.''Computer Organization and Design'' by David A. Patterson and John L. Hennessy 2011 page 64

/ref> Recently, the use of computer clusters with more than one thousand nodes has been spreading. As the number of nodes in a cluster increases, the rapid growth in the complexity of the communication subsystem makes message passing delays over the Switched fabric, interconnect a serious performance issue in the execution of parallel programs.''Recent Advances in the Message Passing Interface'' by Yiannis Cotronis, Anthony Danalis, Dimitris Nikolopoulos and Jack Dongarra 2011 pages 160–162 Specific tools may be used to simulate, visualize and understand the performance of message passing on computer clusters. Before a large computer cluster is assembled, a trace-based simulator can use a small number of nodes to help predict the performance of message passing on larger configurations. Following test runs on a small number of nodes, the simulator reads the execution and message transfer
log files In computing, logging is the act of keeping a log of events that occur in a computer system, such as problems, errors or broad information on current operations. These events may occur in the operating system or in other software. A message or ' ...
and simulates the performance of the messaging subsystem when many more messages are exchanged between a much larger number of nodes.


Messages and computations


Approaches to message passing

Historically, the two typical approaches to communication between cluster nodes have been PVM, the
Parallel Virtual Machine Parallel Virtual Machine (PVM) is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computa ...
and MPI, the
Message Passing Interface The Message Passing Interface (MPI) is a portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of use ...
.''Distributed services with OpenAFS: for enterprise and education'' by Franco Milicchio, Wolfgang Alexander Gehrke 2007, pp. 339-341 However, MPI has now emerged as the de facto standard for message passing on computer clusters.''Recent Advances in Parallel Virtual Machine and Message Passing Interface'' by Matti Ropo, Jan Westerholm and Jack Dongarra 2009 page 231 PVM predates MPI and was developed at the
Oak Ridge National Laboratory Oak Ridge National Laboratory (ORNL) is a federally funded research and development centers, federally funded research and development center in Oak Ridge, Tennessee, United States. Founded in 1943, the laboratory is sponsored by the United Sta ...
around 1989. It provides a set of software libraries that allow a computing node to act as a "parallel virtual machine". It provides run-time environment for message-passing, task and resource management, and fault notification and must be directly installed on every cluster node. PVM can be used by user programs written in C, C++, or Fortran, etc. Unlike PVM, which has a concrete implementation, MPI is a specification rather than a specific set of libraries. The specification emerged in the early 1990 out of discussions between 40 organizations, the initial effort having been supported by ARPA and
National Science Foundation The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
. The design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use
TCP/IP The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suite are ...
and socket connections. MPI is now a widely available communications model that enables parallel programs to be written in languages such as C, Fortran,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
, etc.''Grid and Cluster Computing'' by J. Prabhu 2008 pages 109–112 The MPI specification has been implemented in systems such as
MPICH MPICH, formerly known as MPICH2, is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. MPICH is Free and open source software with some public domain c ...
and
Open MPI Open MPI is a Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). It is used by many TOP500 supercomputers including Roadrunner, which was th ...
.


Testing, evaluation and optimization

Computer clusters use a number of strategies for dealing with the distribution of processing over multiple nodes and the resulting communication overhead. Some computer clusters such as
Tianhe-I Tianhe-I, Tianhe-1, or TH-1 (, ; '' Sky River Number One'') is a supercomputer capable of an Rmax (maximum range) of 2.5 peta FLOPS. Located at the National Supercomputing Center of Tianjin, China, it was the fastest computer in the world ...
use different processors for message passing than those used for performing computations. Tiahnhe-I uses over two thousand FeiTeng-1000 processors to enhance the operation of its proprietary message passing system, while computations are performed by
Xeon Xeon (; ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same archite ...
and
Nvidia Tesla Nvidia Tesla is the former name for a line of products developed by Nvidia targeted at stream processing or GPGPU, general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began us ...
processors.''The TianHe-1A Supercomputer: Its Hardware and Software'' by Xue-Jun Yang, Xiang-Ke Liao, et al in the ''Journal of Computer Science and Technology'', Volume 26, Number 3, May 2011, pages 344–351 ''U.S. says China building 'entirely indigenous' supercomputer'', by Patrick Thibodeau
Computerworld ''Computerworld'' (abbreviated as CW) is a computer magazine published since 1967 aimed at information technology (IT) and Business computing, business technology professionals. Original a print magazine, ''Computerworld'' published its final pr ...
, November 4, 201

/ref> One approach to reducing communication overhead is the use of local neighborhoods (also called Locale (computer hardware), locales) for specific tasks. Here computational tasks are assigned to specific "neighborhoods" in the cluster, to increase efficiency by using processors which are closer to each other. However, given that in many cases the actual
topology Topology (from the Greek language, Greek words , and ) is the branch of mathematics concerned with the properties of a Mathematical object, geometric object that are preserved under Continuous function, continuous Deformation theory, deformat ...
of the computer cluster nodes and their interconnections may not be known to application developers, attempting to fine tune performance at the application program level is quite difficult. Given that MPI has now emerged as the de facto standard on computer clusters, the increase in the number of cluster nodes has resulted in continued research to improve the efficiency and scalability of MPI libraries. These efforts have included research to reduce the
memory footprint Memory footprint refers to the amount of main memory that a program uses or references while running. The word footprint generally refers to the extent of physical dimensions that an object occupies, giving a sense of its size. In computing, t ...
of MPI libraries. From the earliest days MPI provided facilities for performance profiling via the PMPI "profiling system". The use of the PMIPI- prefix allows for the observation of the entry and exit points for messages. However, given the high level nature of this profile, this type of information only provides a glimpse at the real behavior of the communication system. The need for more information resulted in the development of the MPI-Peruse system. Peruse provides a more detailed profile by enabling applications to gain access to state-changes within the MPI-library. This is achieved by registering callbacks with Peruse, and then invoking them as triggers as message events take place.''Recent Advances in Parallel Virtual Machine and Message Passing Interface'' by Bernd Mohr, Jesper Larsson Träff, Joachim Worringen and Jack Dongarra 2006 page 347 Peruse can work with the PARAVER visualization system. PARAVER has two components, a trace component and a visual component for analyze the traces, the statistics related to specific events, etc. ''PARAVER: A Tool to Visualize and Analyze Parallel Code'' by Vincent Pillet et al, Proceedings of the conference on Transputer and Occam Developments, 1995, pages 17–31 PARAVER may use trace formats from other systems, or perform its own tracing. It operates at the task level, thread level, and in a hybrid format. Traces often include so much information that they are often overwhelming. Thus PARAVER summarizes them to allow users to visualize and analyze them.


Performance analysis

When a large scale, often
supercomputer A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...
level, parallel system is being developed, it is essential to be able to experiment with multiple configurations and simulate performance. There are a number of approaches to modeling message passing efficiency in this scenario, ranging from analytical models to trace-based simulation and some approaches rely on the use of test environments based on "artificial communications" to perform ''synthetic tests'' of message passing performance. Systems such as BIGSIM provide these facilities by allowing the simulation of performance on various node topologies, message passing and scheduling strategies.''Petascale Computing: Algorithms and Applications'' by David A. Bader 2007 pages 435–435


Analytical approaches

At the analytical level, it is necessary to model the communication time T in term of a set of subcomponents such as the startup latency, the
asymptotic bandwidth Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel in a communication network, such as Ethernet or packet radio. The data that these messages contain may be delivered ove ...
and the number of processors. A well known model is Hockney's model which simply relies on point to point communication, using T = L + (M / R) where M is the message size, L is the startup latency and R is the asymptotic bandwidth in MB/s.''Modeling Message Passing Overhead'' by C.Y Chou et al. in Advances in Grid and Pervasive Computing: First International Conference, GPC 2006 edited by Yeh-Ching Chung and José E. Moreira pages 299–307 Xu and Hwang generalized Hockney's model to include the number of processors, so that both the latency and the asymptotic bandwidth are functions of the number of processors.''High-Performance Computing and Networking'' edited by Peter Sloot, Marian Bubak and Bob Hertzberge 1998 page 935 Gunawan and Cai then generalized this further by introducing cache size, and separated the messages based on their sizes, obtaining two separate models, one for messages below cache size, and one for those above.


Performance simulation

Specific tools may be used to simulate and understand the performance of message passing on computer clusters. For instance, CLUSTERSIM uses a Java-based visual environment for
discrete-event simulation A discrete-event simulation (DES) models the operation of a system as a ( discrete) sequence of events in time. Each event occurs at a particular instant in time and marks a change of state in the system. Between consecutive events, no change in t ...
. In this approach computed nodes and network topology is visually modeled. Jobs and their duration and complexity are represented with specific
probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...
allowing various parallel
job scheduling A job scheduler is a computer application for controlling unattended background program execution of job (computing), jobs. This is commonly called batch scheduling, as execution of non-interactive jobs is often called batch processing, though tr ...
algorithms to be proposed and experimented with. The communication overhead for
MPI MPI or Mpi may refer to: Science and technology Biology and medicine * Magnetic particle imaging, a tomographic technique * Myocardial perfusion imaging, a medical procedure that illustrates heart function * Mannose phosphate isomerase, an enzyme ...
message passing can thus be simulated and better understood in the context of large-scale parallel job execution.''High Performance Computational Science and Engineering'' edited by Michael K. Ng, Andrei Doncescu, Laurence T. Yang and Tau Leng, 2005 pages 59–63 Other simulation tools include MPI-sim and BIGSIM.''Advances in Computer Science, Environment, Ecoinformatics, and Education'' edited by Song Lin and Xiong Huang 2011 page 16 MPI-Sim is an execution-driven simulator that requires C or C++ programs to operate. ClusterSim, on the other hand uses a hybrid higher-level modeling system independent of the programming language used for program execution. Unlike MPI-Sim, BIGSIM is a trace-driven system that simulates based on the logs of executions saved in files by a separate emulator program.''Languages and Compilers for Parallel Computing'' edited by Keith Cooper, John Mellor-Crummey and Vivek Sarkar 2011 pages 202–203 BIGSIM includes an emulator, and a simulator. The emulator executes applications on a small number of nodes and stores the results, so the simulator can use them and simulate activities on a much larger number of nodes. The emulator stores information of sequential execution blocks (SEBs) for multiple processors in log files, with each SEB recording the messages sent, their sources and destinations, dependencies, timings, etc. The simulator reads the log files and simulates them, and may star additional messages which are then also stored as SEBs. The simulator can thus provide a view of the performance of very large applications, based on the execution traces provided by the emulator on a much smaller number of nodes, before the entire machine is available, or configured.


See also

*
High-availability cluster In computing, high-availability clusters (HA clusters) or fail-over clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability sof ...
* Performance simulation *
Supercomputing A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...


References

{{Parallel Computing Cluster computing Supercomputing Inter-process communication