DGX-1
   HOME

TheInfoList



OR:

Nvidia DGX is a line of
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
-produced servers and workstations which specialize in using
GPGPU General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...
to accelerate
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
applications. The typical design of a DGX system is based upon a rackmount chassis with motherboard that carries high performance x86 server CPUs (Typically Intel Xeons, though recently the DGX A100 and DGX Station A100 utilize
AMD EPYC Epyc is a brand of multi-core x86-64 microprocessors designed and sold by AMD, based on the company's Zen microarchitecture. Introduced in June 2017, they are specifically targeted for the server and embedded system markets. Epyc processors shar ...
CPUs). The main component of a DGX system is a set of 4 to 16
Nvidia Tesla Nvidia Tesla was the name of Nvidia's line of products targeted at stream processing or general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began using GPUs from the G80 ser ...
GPU modules on an independent system board. DGX systems have large
heatsinks All electronic devices and circuitry generate excess heat and thus require thermal management to improve reliability and prevent premature failure. The amount of heat output is equal to the power input, if there are no other energy inte ...
and powerful fans to adequately cool thousands of watts of thermal output. The GPU modules are typically integrated into the system using a version of the SXM socket.


Models


Pascal - Volta


DGX-1

DGX-1 servers feature 8
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...
s based on the
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Fren ...
or Volta
daughter card In computing, an expansion card (also called an expansion board, adapter card, peripheral card or accessory card) is a printed circuit board that can be inserted into an electrical connector, or expansion slot (also referred to as a bus s ...
s with 128GB of total
HBM2 High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators ...
memory, connected by an
NVLink NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was f ...
mesh network A mesh network is a local area network topology in which the infrastructure nodes (i.e. bridges, switches, and other infrastructure devices) connect directly, dynamically and non-hierarchically to as many other nodes as possible and cooperate wit ...
. The DGX-1 was announced on the 6th of April in 2016. All models are based on a dual socket configuration of Intel Xeon E5 CPUs, and are equipped with the following features. * 512 GB of DDR4-2133 * Dual 10Gb networking * 4 x 1.92 TB
SSDs A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is a ...
* 3200W of combined power supply capability * 3U Rackmount Chassis The product line is intended to bridge the gap between GPUs and
AI accelerator An AI accelerator is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications in ...
s in that the device has specific features specializing it for deep learning workloads. The initial Pascal based DGX-1 delivered 170
teraflop In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...
s of
half precision In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications w ...
processing, while the Volta-based upgrade increased this to 960
teraflop In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...
s. The DGX-1 was first available only with the Pascal based configuration, with the first generation SXM socket. The later revision of the DGX-1 offered support for first generation Volta cards via the SXM-2 socket. Nvidia offered upgrade kits that allowed users with a Pascal based DGX-1 to upgrade to a Volta based DGX-1. * The
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Fren ...
based DGX-1 has two variants, one with an
Intel Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
E5-2698 V3, and one with an E5-2698 V4. Pricing for the variant equipped with an E5-2698 V4 is unavailable, the Pascal based DGX-1 with an E5-2698 V3 was priced at launch at $129,000 * The Volta based DGX-1 is equipped with an E5-2698 V4 and was priced at launch at $149,000.


DGX Station

Designed as a
turnkey A turnkey, a turnkey project, or a turnkey operation (also spelled turn-key) is a type of project that is constructed so that it can be sold to any buyer as a completed product. This is contrasted with build to order, where the constructor builds ...
deskside AI supercomputer, the DGX Station is a
tower A tower is a tall Nonbuilding structure, structure, taller than it is wide, often by a significant factor. Towers are distinguished from guyed mast, masts by their lack of guy-wires and are therefore, along with tall buildings, self-supporting ...
computer that can function completely independently without typical datacenter infrastructure such as cooling, redundant power, or 19 inch racks. The DGX station was first available with the following specifications. * Four Volta-based Tesla V100 accelerators, each with 16 GB of
HBM2 High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators ...
memory * 480 TFLOPS FP16 * Single
Intel Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
E5-2698 v4 * 256 GB DDR4 * 4x 1.92 TB SSDs * Dual 10 Gb Ethernet The DGX station is
water-cooled Cooling tower and water discharge of a nuclear power plant Water cooling is a method of heat removal from components and industrial equipment. Evaporative cooling using water is often more efficient than air cooling. Water is inexpensive and non ...
to better manage the heat of almost 1500W of total system components, this allows it to keep a noise range under 35 dB under load. This, among other features, made this system a compelling purchase for customers without the infrastructure to run
rackmount A 19-inch rack is a standardized frame or enclosure for mounting multiple electronic equipment modules. Each module has a front panel that is wide. The 19 inch dimension includes the edges or "ears" that protrude from each side of the equ ...
DGX systems, which can be loud, output a lot of heat, and take up a large area. This was Nvidia's first venture into bringing
high performance computing High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Overview HPC integrates systems administration (including network and security knowledge) and parallel programming into a multid ...
deskside, which has since remained a prominent marketing strategy for Nvidia.https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-station/nvidia-dgx-station-a100-datasheet.pdf


DGX-2

The successor of the Nvidia DGX-1 is the Nvidia DGX-2, which uses sixteen Volta-based V100 32GB (second generation) cards in a single unit. It was announced on the 27th of March in 2018. The DGX-2 delivers 2 Petaflops with 512GB of shared memory for tackling massive datasets and uses NVSwitch for high-bandwidth internal communication. DGX-2 has a total of 512GB of
HBM2 High Bandwidth Memory (HBM) is a high-speed computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators ...
memory, a total of 1.5TB of
DDR4 Double Data Rate 4 Synchronous Dynamic Random-Access Memory (DDR4 SDRAM) is a type of synchronous dynamic random-access memory with a high bandwidth ("double data rate") interface. Released to the market in 2014, it is a variant of dynamic rando ...
. Also present are eight 100Gb/sec
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
cards and 30.72 TB of SSD storage, all enclosed within a massive 10U rackmount chassis and drawing up to 10 kW under maximum load. The initial price for the DGX-2 was $399,000. The DGX-2 differs from other DGX models in that it contains two separate GPU daughterboards, each with eight GPUs. These boards are connected by an NVSwitch system that allows for full bandwidth communication across all GPUs in the system, without additional latency between boards. A higher performance variant of the DGX-2, the DGX-2H, was offered as well. The DGX-2H replaced the DGX-2's dual Intel Xeon Platinum 8168's with upgraded dual Intel Xeon Platinum 8174's. This upgrade does not increase core count per system, as both CPUs are 24 cores, nor does it enable any new functions of the system, but it does increase the base frequency of the CPUs from 2.7 GHz to 3.1 GHz.


Ampere


DGX A100 Server

Announced and released on May 14, 2020. The DGX A100 was the 3rd generation of DGX server, including 8
Ampere The ampere (, ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to elect ...
-based A100 accelerators. Also included is 15TB of
PCIe PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common mo ...
gen 4
NVMe NVM Express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCIS) is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via PCI Express (PCIe) bus. The ...
storage, 1 TB of RAM, and eight
Mellanox Mellanox Technologies Ltd. ( he, מלאנוקס טכנולוגיות בע"מ) was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, softwa ...
-powered 200GB/s HDR
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
ConnectX-6 NICs. The DGX A100 is in a much smaller enclosure than its predecessor, the DGX-2, taking up only 6 Rack units. The DGX A100 also moved to an AMD EYPC 7742 CPU, the first DGX server to not be built with an
Intel Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
CPU. The initial price for the DGX A100 Server was $199,000.


DGX Station A100

As the successor to the original DGX Station, the DGX Station A100, aims to fill the same niche as the DGX station in being a quiet, efficient, turnkey cluster-in-a-box solution that can be purchased, leased, or rented by smaller companies or individuals who want to utilize machine learning. It follows many of the design choices of the original DGX station, such as the tower orientation, single socket CPU
mainboard A motherboard (also called mainboard, main circuit board, mb, mboard, backplane board, base board, system board, logic board (only in Apple computers) or mobo) is the main printed circuit board (PCB) in general-purpose computers and other expand ...
, a new refrigerant-based cooling system, and a reduced number of accelerators compared to the corresponding rackmount DGX A100 of the same generation. The price for the DGX Station A100 320G is $149,000 and $99,000 for the 160G model, Nvidia also offers Station rental at ~$9000 USD per month through partners in the US (rentacomputer.com) and Europe (iRent IT Systems) to help reduce the costs of implementing these systems at a small scale. The DGX Station A100 comes with two different configurations of the built in A100. * Four
Ampere The ampere (, ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to elect ...
-based A100 accelerators, configured with 40GB (HBM) or 80GB (HBM2e) memory,
thus giving a total of 160GB or 320GB resulting either in DGX Station A100 variants 160G or 320G. * 2.5 PFLOPS FP16 * Single 64 Core
AMD EPYC Epyc is a brand of multi-core x86-64 microprocessors designed and sold by AMD, based on the company's Zen microarchitecture. Introduced in June 2017, they are specifically targeted for the server and embedded system markets. Epyc processors shar ...
7742 * 512 GB
DDR4 Double Data Rate 4 Synchronous Dynamic Random-Access Memory (DDR4 SDRAM) is a type of synchronous dynamic random-access memory with a high bandwidth ("double data rate") interface. Released to the market in 2014, it is a variant of dynamic rando ...
* 1 x 1.92 TB
NVMe NVM Express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCIS) is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via PCI Express (PCIe) bus. The ...
OS drive * 1 x 7.68 TB U.2 NVMe Drive * Dual port 10Gb Ethernet * Single port 1Gb BMC port


Hopper


DGX H100 Server

Announced March 22, 2022 and planned for release in Q3 2022, The DGX H100 is the 4th generation of DGX servers, built with 8
Hopper Hopper or hoppers may refer to: Places *Hopper, Illinois * Hopper, West Virginia * Hopper, a mountain and valley in the Hunza–Nagar District of Pakistan * Hopper (crater), a crater on Mercury People with the name * Hopper (surname) * Grace H ...
-based H100 accelerators, for a total of 32 PFLOPs of FP8 AI compute and 640GB of HBM3 Memory, an upgrade over the DGX A100s HBM2 memory. This upgrade also increases
VRAM Video random access memory (VRAM) is dedicated computer memory used to store the pixels and other graphics data as a framebuffer to be rendered on a computer monitor. This is often different technology than other computer memory, to facilitate b ...
bandwidth to 3 TB/s. The DGX H100 increases the
rackmount A 19-inch rack is a standardized frame or enclosure for mounting multiple electronic equipment modules. Each module has a front panel that is wide. The 19 inch dimension includes the edges or "ears" that protrude from each side of the equ ...
size to 8U to accommodate the 700W TDP of each H100 SXM card. The DGX H100 also has two 1.92TB SSDs for
Operating System An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
storage, and 30.72 TB of
Solid state storage A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is a ...
for application data. One more notable addition is the presence of two Nvidia Bluefield 3 DPUs, and the upgrade to 400Gb/s
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
via
Mellanox Mellanox Technologies Ltd. ( he, מלאנוקס טכנולוגיות בע"מ) was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, softwa ...
ConnectX-7 NICs, double the bandwidth of the DGX A100. The DGX H100 uses new 'Cedar Fever' cards, each with four ConnectX-7 400GB/s controllers, and two cards per system. This gives the DGX H100 3.2Tb/s of fabric bandwidth across Infiniband. The DGX H100 has two currently unspecified 4th generation
Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
Scalable CPUs (Codenamed
Sapphire Rapids Sapphire Rapids is a List of Intel codenames, codename for Intel's server (fourth generation Xeon Scalable) and workstation processors based on 7 nm process, Intel 7. Sapphire Rapids was intended as part of the Eagle Stream server platform. In a ...
) and 2 Terabytes of System Memory. No pricing is currently available for the DGX H100.


DGX SuperPod

The DGX Superpod is a high performance turnkey
supercomputer A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
solution provided by
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
using DGX hardware. This tightly integrated system combines high performance DGX compute nodes with fast storage and high bandwidth networking to provide a unique plug & play solution to extremely high demand machine learning workloads. The Selene Supercomputer, at the
Argonne National Laboratory Argonne National Laboratory is a science and engineering research United States Department of Energy National Labs, national laboratory operated by University of Chicago, UChicago Argonne LLC for the United States Department of Energy. The facil ...
, is one example of a DGX SuperPod based system. Selene, built from 280 DGX A100 nodes, ranked 5th on the
Top500 The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
list for most powerful supercomputers at the time of its completion, and has continued to remain high in performance. This same integration is available to any customer with minimal effort on their behalf, and the new
Hopper Hopper or hoppers may refer to: Places *Hopper, Illinois * Hopper, West Virginia * Hopper, a mountain and valley in the Hunza–Nagar District of Pakistan * Hopper (crater), a crater on Mercury People with the name * Hopper (surname) * Grace H ...
based SuperPod can scale to 32 DGX H100 nodes, for a total of 256 H100 GPUs and 64
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
CPUs. This gives the complete SuperPod a whopping 20TB of HBM3 memory, 70.4 TB/s of bisection bandwidth, and up to 1
ExaFLOP In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...
of FP8 AI compute. These SuperPods can then be further joined to create even larger supercomputers. The upcoming Eos supercomputer, designed, built, and operated by Nvidia, will be constructed of 18 H100 based SuperPods, totaling 576 DGX H100 systems, 500 Quantum-2
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
switches, and 360
NVLink NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was f ...
Switches, this will allow Eos to deliver 18 EFLOPs of FP8 compute, and 9 EFLOPs of FP16 compute, making Eos the fastest AI supercomputer in the world. As Nvidia does not produce any storage devices or systems, Nvidia SuperPods rely on partners to provide high performance storage. Current storage partners for Nvidia Superpods are
Dell EMC Dell EMC (EMC Corporation until 2016) is an American multinational corporation headquartered in Hopkinton, Massachusetts and Round Rock, Texas, United States. Dell EMC sells data storage, information security, virtualization, analytics, cloud c ...
, DDN, HPE, IBM,
NetApp NetApp, Inc. is an American hybrid cloud data services and data management company headquartered in San Jose, California. It has ranked in the Fortune 500 from 2012–2021. Founded in 1992 with an IPO in 1995, NetApp offers cloud data services ...
, Pavilion Data, and VAST Data.


Accelerators


See also

*
Deep Learning Super Sampling Deep learning super sampling (DLSS) is a family of real-time deep learning image enhancement and upscaling technologies developed by Nvidia that are exclusive to its RTX line of graphics cards, and available in a number of video games. The goal o ...
*
Nvidia Tesla Nvidia Tesla was the name of Nvidia's line of products targeted at stream processing or general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began using GPUs from the G80 ser ...
*
Supercomputer A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...

Page on high performance computing with 4x and 8x A100 per computer node
also showing switch topology dumps.


References

{{Nvidia AI accelerators GPGPU Nvidia products Parallel computing