Cray XMT (''Cray eXtreme MultiThreading'',
codenamed ''Eldorado''
) is a
scalable
Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
multithreaded shared memory supercomputer
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
architecture by
Cray
Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
, based on the third generation of the
Tera MTA architecture, targeted at large graph problems (e.g. semantic databases,
big data
Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
,
pattern matching).
[Maltby, James (2012). ]
Cray XMT Multithreated programming model
'' "Using the next-generation Cray XMT (uRiKA) for Large Scale Data Analytics." Swiss National Supercomputing Centre
Swiss may refer to:
* the adjectival form of Switzerland
*Swiss people
Places
*Swiss, Missouri
*Swiss, North Carolina
*Swiss, West Virginia
*Swiss, Wisconsin
Other uses
*Swiss-system tournament, in various games and sports
*Swiss International A ...
. Presented in 2005, it supersedes the earlier unsuccessful
Cray MTA-2. It uses the Threadstorm3 CPUs inside
Cray XT3
The Cray XT3 is a distributed memory massively parallel MIMD supercomputer designed by Cray Inc. with Sandia National Laboratories under the codename '' Red Storm''. Cray turned the design into a commercial product in 2004. The XT3 derives much o ...
blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support.
It brought various substantial improvements over Cray MTA-2, most notably nearly tripling the peak performance, and vastly increased maximum CPU count to 8,192 and maximum memory to 128 TB, with a data
TLB of maximal 512 TB.
Cray XMT uses a scrambled
content-addressable memory model on
DDR1 ECC modules to implicitly load-balance memory access across the whole shared global address space of the system.
Use of 4 additional Extended Memory Semantics bits (''full/empty'', ''forwarding'' and 2 ''trap'' bits) per 64-bit memory word enables lightweight, fine-grained synchronization on all memory.
There are no hardware interrupts and hardware threads are allocated by an instruction, not the OS.
Front-end (login, I/O, and other service nodes, utilizing
AMD Opteron processors and running
SLES Linux) and back-end (compute nodes, utilizing Threadstorm3 processors and running MTK, a simple
BSD Unix
The Berkeley Software Distribution or Berkeley Standard Distribution (BSD) is a discontinued operating system based on Research Unix, developed and distributed by the Computer Systems Research Group (CSRG) at the University of California, Berk ...
-based
microkernel) communicate through the LUC (Lightweight User Communication) interface, a
RPC
RPC may refer to:
Science and technology
* Rational polynomial coefficient
* Reactive Plastic Curtain, a carbon-dioxide-absorbing device used in some rebreather breathing sets
* Regional Playback Control, a regional lockout technology for DVDs
* ...
-style bidirectional client/server interface.
Threadstorm3
Threadstorm3 (referred to as "MT processor"
and ''Threadstorm'' before XMT2
) is a 64-bit single-core
VLIW barrel processor (compatible with 940-pin
Socket 940 used by
AMD Opteron processors) with 128 hardware ''streams'', onto each a software thread can be mapped (effectively creating 128
hardware threads per CPU), running at 500 MHz and using the
MTA instruction set or a superset of it.
[The Tera MTA ISA is closed-sourced and it is only due to a workshop presentation asserting backward-compatibility with previous MTA systems that the ISA used on Threadstorm CPUs cannot be a subset of MTA ISA.] It has a 128KB, 4-way associative data buffer. Each Threadstorm3 has 128 separate register sets and program counters (one per each stream), which are fairly fully
context-switched at each cycle.
Its estimated peak performance is 1.5
GFLOPS. It has 3 functional units (memory,
fused multiply-add
Fuse or FUSE may refer to:
Devices
* Fuse (electrical), a device used in electrical systems to protect against excessive current
** Fuse (automotive), a class of fuses for vehicles
* Fuse (hydraulic), a device used in hydraulic systems to prote ...
and control), which receive operations from the same MTA instruction and operate within the same cycle.
Each stream has 32 general-purpose registers, 8 target registers and a status word, containing the program counter.
High-level control of job allocation across threads is not possible.
[Though it is not known if it is possible on the instruction-level.] Due to the MTA's
pipeline length of 21, each stream is selected to execute instructions again no prior than 21 cycles later. The
TDP of the processor package is 30 W.
Due to their thread-level context switch at each cycle, performance of Threadstorm CPUs is not constrained by memory access time. In a simplified model, at each clock cycle an instruction from one of the threads is executed and another memory request is queued with the understanding that by the time the next round of execution is ready the requested data has arrived. This is contrary to many conventional architectures which stall on memory access. The architecture excels in data walking schemes where subsequent memory access cannot be easily predicted and thus wouldn't be well suited to a conventional cache model.
Threadstorm's principal architect was
Burton J. Smith.
Cray XMT2
Cray XMT2
(also "next-generation XMT"
or simply ''XMT''
) is a scalable multithreaded
shared memory supercomputer
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
by
Cray
Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
, based on the fourth generation of the
Tera MTA architecture.
Presented in 2011, it supersedes Cray XMT, which had issues with memory hotspots.
It uses Threadstorm4 CPUs inside
Cray XT5 blades and increases memory capacity eightfold to 512 TB and memory bandwidth trifold (300 MHz instead 200 MHz) compared to XMT by using twice the memory modules per node and DDR2.
It introduces the Node Pair Link inter-Threadstorm connect, as well as memory-only nodes, with Threadstorm4 packages having their CPU and
HyperTransport 1.x components disabled.
The underlying scrambled content-addressable memory model has been inherited from XMT. XMT2 uses 2 additional EMS bits (''full/empty'' and ''extended'') instead of 4 as in XMT.
Threadstorm4
Threadstorm4 (also "Threadstorm IV"
and "Threadstorm 4.0"
[On physical package.]) is a 64-bit single-core
VLIW barrel processor (compatible with 1207-pin
Socket F used by
AMD Opteron processors) with 128 hardware streams, very similar to its predecessor, Threadstorm3. It features an improved, DDR2-capable memory controller and additional 8 ''trap'' registers per stream. Cray intentionally decided against a DDR3 controller, citing the reusing of existing Cray XT5 infrastructure
[Even though the DDR3-based Cray XT6 was launched in 2009, two years prior to XMT2.] and a shorter burst length than DDR3.
[As Cray XMT mostly operates with single 8-byte word random accesses and has a 128-bit memory channel, at DDR2 burst length of 4, the usual overhead is 56 bytes. DDR3 with its burst length of 8 would increase the usual overhead to 120 bytes.] Though the longer burst length could be compensated by higher speeds of DDR3, it would also require more power, which Cray engineers wanted to avoid.
Scorpio
After launching XMT, Cray researched a possible multi-core variant of the Threadstorm3, dubbed ''Scorpio''. Most of Threadstorm3's features would be retained, including the multiplexing of many hardware streams onto an execution pipeline and the implementation of additional state bits for every 64-bit memory word. Cray later abandoned Scorpio, and the project yielded no manufactured chip.
Future
Development on Threadstorm4, as well as the whole MTA architecture, ended silently after XMT2, probably due to competition from commodity processors such as Intel's
Xeon and possibly
Xeon Phi, even though Cray never officially discontinued neither XMT nor XMT2. As of 2020, Cray has removed all customer documentation on both XMT and XMT2 from its online catalogue.
Users
Cray XMT2 was bought by several federal laboratories and academic facilities, as well as some commercial HPC clients: e.g.
CSCS (2 TB global memory with 64 Threadstorm4 CPUs), Noblis CAHPC.
Most of XMT and XMT2-based systems have been decommissioned by 2020.
Notes
References
{{Cray computers
Xmt
Supercomputers
Computer-related introductions in 2005