SHMEM (from Cray Research’s “shared memory” library) is a family of parallel programming libraries, providing one-sided, RDMA, parallel-processing interfaces for low-latency distributed-memory supercomputers. The SHMEM acronym was subsequently reverse engineered to mean "Symmetric Hierarchical MEMory”. Later it was expanded to

distributed memory In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task m ...

parallel computer clusters, and is used as parallel programming interface or as low-level interface to build

partitioned global address space In computer science, partitioned global address space (PGAS) is a parallel programming model paradigm. PGAS is typified by communication operations involving a global memory address space abstraction that is logically partitioned, where a portion ...

(PGAS) systems and languages. “Libsma”, the first SHMEM library, was created by Richard Smith at Cray Research in 1993 as a set of thin interfaces to access the CRAY T3D’s inter-processor-communication hardware. SHMEM has been implemented by Cray Research, SGI, Cray Inc., Quadrics, HP, GSHMEM, IBM, QLogic, Mellanox, Universities of Houston and Florida; there is also open-source OpenSHMEM. SHMEM laid the foundations for low-latency (sub-microsecond) one-sided communication. After its use on the CRAY T3E, its popularity waned as few machines could deliver the near-microsecond latencies necessary to maintain efficiency for its hallmark individual-word communication. With the advent of popular sub-microsecond interconnects, SHMEM has been used to address the necessity of hyper-efficient, portable, parallel-communication methods for exascale computing. Programs written using SHMEM can be started on several computers, connected together with some high-performance network, supported by used SHMEM library. Every computer runs a copy of a program (

SPMD In computing, single program, multiple data (SPMD) is a technique employed to achieve parallelism; it is a subcategory of MIMD. Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results fas ...

); each copy is called PE (processing element). PEs can ask the SHMEM library to do remote memory-access operations, like reading ("shmem_get" operation) or writing ("shmem_put" operation) data. Peer-to-peer operations are one-sided, which means that no active cooperation from remote thread is needed to complete the action (but it can poll its local memory for changes using "shmem_wait"). Operations can be done on short types like bytes or words, or on longer datatypes like arrays, sometimes evenly strided or indexed (only some elements of array are sent). For short datatypes, SHMEM can do atomic operations (

CAS Cas may refer to: * Caș, a type of cheese made in Romania * ' (1886–) Czech magazine associated with Tomáš Garrigue Masaryk * '' Čas'' (19 April 1945–February 1948), the official, daily newspaper of the Democratic Party of Slovakia * ''CA ...

, fetch and add, atomic increment, etc.) even in remote memory. Also there are two different synchronization methods: task control sync (barriers and locks) and functions to enforce memory fencing and ordering. SHMEM has several collective operations, which should be started by all PEs, like reductions, broadcast, collect. Every PE has some of its memory declared as "symmetric" segment (or shared memory area) and other memory is private. Only "shared" memory can be accessed in one-sided operation from remote PEs. Programmers can use static-memory constructs or shmalloc/shfree routines to create objects with symmetric address that span the PEs.

Typical SHMEM functions

* start_pes(N) - start N processing elements (PE) * _my_pe() - ask SHMEM to return the PE identifier of current thread * shmem_barrier_all() - wait until all PEs run up to barrier; then enable them to go further * shmem_put(target, source, length, pe) - write data of length "length" to the remote address "target" on PE with id "pe" from local address "source" * shmem_get(target, source, length, pe) - read data of length "length" from the remote address "source" on PE with id "pe" and save to read values into local address "target"

List of SHMEM implementations

* Cray Research: Original SHMEM for CRAY T3D, CRAY T3E, and Cray Research PVP supercomputers * SGI: SGI-SHMEM for systems with NUMALink and Altix build with Infiniband network adapters * Cray Inc.: MP-SHMEM for Unicos MP (X1E supercomputer) * Cray Inc.: LC-SHMEM for Unicos LC (Cray XT3, XT4, XT5) * Quadrics: Q-SHMEM for Linux clusters with QsNet interconnect * Cyclops-64 SHMEM * HP SHMEM * IBM SHMEM * GPSHMEM

OpenSHMEM implementations

OpenSHMEM is a standard effort by SGI and Open Source Software Solutions, Inc. * University of Houston: Reference OpenSHMEM * Mellanox ScalableSHMEM * Portals-SHMEM (on top of Portals interface) * University of Florida: Gator SHMEM *

Open MPI Open MPI is a Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). It is used by many TOP500 supercomputers including Roadrunner, which was th ...

includes an implementation of OpenSHMEM * Adapteva Epiphany CoprocessorJames Ross and David Richie. An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor. In Proceedings of the Third Workshop on OpenSHMEM and Related Technologies, . Springer, 2016.

Disadvantages

In first years SHMEM was accessible only on some Cray Research machines (later additionally on SGI)SHMEM
// Cray, Document 004-2178-002, chapter 3 equipped with special networks, limiting library widespread and being

vendor lock-in In economics, vendor lock-in, also known as proprietary lock-in or customer lock-in, makes a customer dependent on a vendor for products, unable to use another vendor without substantial switching costs. The use of open standards and alternativ ...

(for example, Cray Research recommended to partially rewrite MPI programs to combine both MPI and shmem calls, which make the program non-portable to other clear-MPI environment). SHMEM was not defined as standard, so there were created several incompatible variants of SHMEM libraries by other vendors. Libraries had different include file names, different management function names for starting PEs or getting current PE id, and some functions were changed or not supported. Some SHMEM routines were designed according to CRAY T3D architecture limitations, for example reductions and broadcasts could be started only on subsets of PEs with size being power of two.Introduction to Parallel Computing - 3.11 Related Work
// cse590o course, University of Washington, Winter 2002; page 154 Variants of SHMEM libraries can run on top of any MPI library, even when a cluster has only non-rdma optimized Ethernet, however the performance will be typically worse than other enhanced networking protocols. Memory in shared region should be allocated using special functions (shmalloc/shfree), not with the system malloc. SHMEM is available only for C and Fortran (some versions also to C++).OpenSHMEM TUTORIAL
// University of Houston, Texas, 2012 Many disadvantages of SHMEM have been overcome with the use of OpenSHMEM on popular sub-microsecond interconnects, driven by exascale development.

References

External links

(SGI TPL) - Introduction to the SHMEM programming model
OpenSHMEM
an effort to create a specification for a standardized API for parallel programming in the Partitioned Global Address Space. {{Parallel Computing Parallel computing Application programming interfaces