SX-Aurora TSUBASA
   HOME

TheInfoList



OR:

The NEC SX-Aurora TSUBASA is a
vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
of the
NEC SX architecture is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network sol ...
family. Unlike previous SX
supercomputer A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
s, the SX-Aurora TSUBASA is provided as a PCIe card, termed by
NEC is a Japanese multinational corporation, multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It prov ...
as a "Vector Engine" (VE). Eight VE cards can be inserted into a vector host (VH) which is typically a
x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...
server running the Linux operating system. The product has been announced in a press release on 25 October 2017 and NEC has started selling it in February 2018. The product succeeds the SX-ACE.


Hardware

SX-Aurora TSUBASA is a successor to the NEC SX series and
SUPER-UX SUPER-UX is a version of the Unix operating system from NEC that is used on its SX series of supercomputers. History The initial version of SUPER-UX was based on UNIX System V version 3.1 with features from BSD 4.3. The version for the NEC SX- ...
, which are vector computer systems upon which the
Earth Simulator The is a series of supercomputers deployed at Japan Agency for Marine-Earth Science and Technology Yokohama Institute of Earth Sciences. Earth Simulator (first generation) The first generation of Earth Simulator, developed by the Japanese go ...
supercomputer is based. Its hardware consists of x86 Linux hosts with vector engines (VEs) connected via
PCI express PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common ...
(PCIe) interconnect. High memory bandwidth (0.75–1.2 TB/s), comes from eight cores and six
HBM2 High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators ...
memory modules on a silicon interposer implemented in the form-factor of a PCIe card. Operating system functionality for the VE is offloaded to the VH and handled mainly by user space daemons running the VEOS. Depending on the clock frequency (1.4 or 1.6 GHz), each VE CPU has eight cores and a peak performance of 2.15 or 2.45 
TFLOPS In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...
in double precision. The processor has the world's first implementation of six HBM2 modules on a Silicon interposer with a total of 24 or 48GB of high bandwidth memory. It is integrated in the form-factor of a standard full length, full height, double width PCIe card that is hosted by an x86_64 server, the Vector Host (VH). The server can host up to eight VEs, clusters VHs can scale to arbitrary number of nodes.


Product releases

Version 2 Vector Engine Version 1 Vector Engine The version 1.0 of the Vector Engine was produced in 16 nm
FinFET A fin field-effect transistor (FinFET) is a multigate device, a MOSFET (metal-oxide-semiconductor field-effect transistor) built on a substrate where the gate is placed on two, three, or four sides of the channel or wrapped around the channel, f ...
process (from
TSMC Taiwan Semiconductor Manufacturing Company Limited (TSMC; also called Taiwan Semiconductor) is a Taiwanese multinational corporation, multinational semiconductor contract manufacturing and design company. It is the world's most valuable semicon ...
) and released in three SKUs (subsequent versions add an E at the end):


Functional units

Each of the eight SX-Aurora cores has 64 logical vector registers. These have 256 x 64 Bits length implemented as a mix of pipeline and 32-fold parallel SIMD units. The registers are connected to three FMA floating-point multiply and add units that can run in parallel, as well as two ALU arithmetical logical units handling fixed point operations and a divide and square root pipe. Considering only the FMA units and their 32-fold SIMD parallelism, a vector core is capable of 192 double precision operations per cycle. In "packed" vector operations, where two single precision values are loaded into the space of one double precision slot in the vector registers, the vector unit delivers twice as many operations per clock cycle compared to double precision. A Scalar Processing Unit (SPU) handles non-vector instructions on each of the cores.


Memory and caches

The memory of the SX-Aurora TSUBASA processor consists of six
HBM2 High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators ...
second generation high-bandwidth memory modules implemented in the same package as the CPU with the help of Chip-on-Wafer-on-Substrate technology. Depending on the processor model, the HBM2 modules are either 4 or 8 die 3D modules with either 4 or 8 GB capacity, each. The SX-Aurora CPUs thus have either 24GB or 48GB HBM2 memory. The models implemented with large HBM2 modules have 1.2TB/s memory bandwidth. The cores of a vector engine share 16MB of "Last-Level-Cache" (LLC), a write-back cache directly connected to the vector registers and the L2 cache of the SPU. The LLC cache line size is 128 Bytes. The priority of data retention in the LLC can to some extent be controlled in software, allowing the programmer to specify which of the variables or arrays should be retained in cache, a feature comparable to that of the Advanced Data Buffer (ADB) of the
NEC SX-ACE The SX-ACE is a vector NEC SX supercomputer from NEC Corporation. It features NEC's first multi-core vector System on a Chip design, with four cores. The SX-ACE runs at 1 GHz, has peak performance of 64 GFLOPS per core, and has 64 gigabytes per ...
.


Platforms

NEC is currently selling the SX-Aurora TSUBASA vector engine integrated into four platforms: * A111-1: a tower PC with one VE card of type 10B * A101-1: a tower PC with one VE card of type 10CE * A311-4: a dual socket 1U 19 inch rack-mountable Xeon scalable server equipped with up to four VE cards of type BE * A311-8: a dual socket 4U 19 inch rack-mountable Xeon scalable server with up to eight VE cards of type BE * A511-64: a 19 inch rack equipped with 64 VEs of type AE. This is the only configuration that is explicitly sold as a supercomputer. Within a VH node VEs can communicate with each other through PCIe. Large parallel systems built with SX-Aurora use
Infiniband InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
in a PeerDirect setup as interconnect. NEC also used to sell the SX-Aurora TSUBASA vector engine integrated into five platforms: * A100-1: a tower PC with one VE card of type 10C. * A300-2: a single socket 1U rack-mountable Skylake server equipped with up to two VE cards of type 10B or 10C. * A300-4: a dual socket 1U rack-mountable Skylake server equipped with up to four VE cards of type 10B or 10C. * A300-8: a dual socket 4U rack-mountable Skylake server with up to eight VE cards of type 10B or 10C. * A500-64: a rack equipped with either Intel Xeon Silver 4100 family or Intel Xeon Gold 6100 family CPUs and 32, 48 or 64 VEs of type 10A or 10B. All types are exclusively air cooled with the exception of the A500 series, which also utilizes watercooling.


Software


Operating system

The operating system of the vector engine (VE) is called "VEOS", and has been offloaded entirely to the host system, the vector host (VH). VEOS consists of kernel modules and user space daemons that: * manage VE processes and their scheduling on the VE * manage the virtual memory address spaces of the VE processes * handle transfers between VH and VE memory with the help of the VE DMA engines * handle interrupts and exceptions of VE processes, as well as their system calls. VEOS supports multitasking on the VE and almost all Linux system calls are supported in the VE libc. Offloading operating system services to the VH shifts OS jitter away from the VE at the expense of increased latencies. All VE operating system related packages are licensed under the
GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th ...
and have been published at .


Software development

A Software Development Kit is available from NEC for developers and customers. It contains proprietary products and must be purchased from NEC. The SDK contains: * C, C++ and Fortran compilers that support automatic vectorization and automatic parallelization as well as OpenMP. * Performance optimization tools: ftraceviewer and veperf. * Optimized numerical libraries for the VE: BLAS, SBLAS, LAPACK, SCALAPACK, ASL, Heterosolver. NEC MPI is also a proprietary implementation and is conforming to the MPI-3.1 standard specification. Hybrid programs can be created that use the VE as an accelerator for certain host kernel functions by using VE offloading C-API. To some extent VE offloading is comparable to OpenCL and CUDA, but provides a simpler API and allows the kernels to be developed in normal C, C++ or Fortran and use almost any syscall on the VE. Python bindings to VEO are available at .


References


External links


News and articles
for SX-Aurora Vector Engine.
NEC Aurora Forum

NEC Aurora Web Forum

NEC Aurora VEOS

NEC Aurora Vectorization Training

Collection of tools and projects
{{NEC supercomputers Coprocessors SX-Aurora Parallel computing Vector supercomputers