H100 Tensor Core GPU
   HOME

TheInfoList



OR:

Hopper is a
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
(GPU)
microarchitecture In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular ...
developed by
Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
. It is designed for datacenters and is used alongside the Lovelace microarchitecture. It is the latest generation of the line of products formerly branded as
Nvidia Tesla Nvidia Tesla is the former name for a line of products developed by Nvidia targeted at stream processing or GPGPU, general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began us ...
, now Nvidia Data Centre GPUs. Named for computer scientist and
United States Navy The United States Navy (USN) is the naval warfare, maritime military branch, service branch of the United States Department of Defense. It is the world's most powerful navy with the largest Displacement (ship), displacement, at 4.5 millio ...
rear admiral Rear admiral is a flag officer rank used by English-speaking navies. In most European navies, the equivalent rank is called counter admiral. Rear admiral is usually immediately senior to commodore and immediately below vice admiral. It is ...
Grace Hopper Grace Brewster Hopper (; December 9, 1906 – January 1, 1992) was an American computer scientist, mathematician, and United States Navy rear admiral. She was a pioneer of computer programming. Hopper was the first to devise the theory of mach ...
, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the
Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical compute ...
and
Ampere The ampere ( , ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to 1 c ...
microarchitectures, featuring a new streaming multiprocessor, a faster memory subsystem, and a
transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
acceleration engine.


Architecture

The Nvidia Hopper H100 GPU is implemented using the
TSMC Taiwan Semiconductor Manufacturing Company Limited (TSMC or Taiwan Semiconductor) is a Taiwanese multinational semiconductor contract manufacturing and design company. It is one of the world's most valuable semiconductor companies, the world' ...
N4 process with 80 billion transistors. It consists of up to 144 streaming multiprocessors. Due to the increased memory bandwidth provided by the SXM5 socket, the Nvidia Hopper H100 offers better performance when used in an SXM5 configuration than in the typical PCIe socket.


Streaming multiprocessor

The streaming multiprocessors for Hopper improve upon the
Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical compute ...
and
Ampere The ampere ( , ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to 1 c ...
microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64. The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory. Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async. When parallelizing applications, developers can use thread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as
distributed shared memory In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as a single shared address space. The term "shared" does not mean that there is a single centralized memo ...
. Distributed shared memory may be used by an SM simultaneously with
L2 cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the cudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks. With L2 multicasting and distributed shared memory, the required bandwidth for
dynamic random-access memory Dynamics (from Greek language, Greek δυναμικός ''dynamikos'' "powerful", from δύναμις ''dynamis'' "power (disambiguation), power") or dynamic may refer to: Physics and engineering * Dynamics (mechanics), the study of forces and t ...
read and writes is reduced. Hopper features improved
single-precision floating-point format Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floa ...
(FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including the
Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...
. Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.


Memory

The Nvidia Hopper H100 supports HBM3 and
HBM2 High Bandwidth Memory (HBM) is a computer memory interface for 3D-stacked synchronous dynamic random-access memory (SDRAM) initially from Samsung, AMD and SK Hynix. It is used in conjunction with high-performance graphics accelerators, network de ...
e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased. Hopper allows
CUDA In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...
compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main pro ...
s to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus its
compressibility In thermodynamics and fluid mechanics, the compressibility (also known as the coefficient of compressibility or, if the temperature is held constant, the isothermal compressibility) is a measure of the instantaneous relative volume change of a f ...
) may be changed at any time. The compressor will automatically choose between several compression algorithms. The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attribute cudaFuncAttributePreferredSharedMemoryCarveout may be used to define the carveout of the L1 cache. Hopper introduces enhancements to
NVLink NVLink is a wire-based serial multi-lane near-range communications protocol, communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central ...
through a new generation with faster overall communication bandwidth.


Memory synchronization domains

Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink or
PCIe PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a high-speed standard used to connect hardware components inside computers. It is designed to replace older expansion bus standards such as Peripher ...
writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.


DPX instructions

The Hopper architecture math
application programming interface An application programming interface (API) is a connection between computers or between computer programs. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standard that des ...
(API) exposes functions in the SM such as __viaddmin_s16x2_relu, which performs the per- halfword max(min(a + b, c), 0). In the Smith–Waterman algorithm, __vimax3_s16x2_relu can be used, a three-way min or max followed by a clamp to zero. Similarly, Hopper speeds up implementations of the
Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
.


Transformer engine

The Hopper architecture was the first Nvidia architecture to implement the transformer engine. The transformer engine accelerates computations by dynamically reducing them from higher numerical precisions (i.e., FP16) to lower precisions that are faster to perform (i.e., FP8) when the loss in precision is deemed acceptable. The transformer engine is also capable of dynamically allocating bits in the chosen precision to either the mantissa or exponent at runtime to maximize precision.


Power efficiency

The SXM5 form factor H100 has a thermal design power (TDP) of 700
watt The watt (symbol: W) is the unit of Power (physics), power or radiant flux in the International System of Units (SI), equal to 1 joule per second or 1 kg⋅m2⋅s−3. It is used to quantification (science), quantify the rate of Work ...
s. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.


Grace Hopper

The GH200 combines a Hopper-based H100 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.


History

In November 2019, a well-known
Twitter Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
account posted a tweet revealing that the next architecture after
Ampere The ampere ( , ; symbol: A), often shortened to amp,SI supports only the use of symbols and deprecates the use of abbreviations for units. is the unit of electric current in the International System of Units (SI). One ampere is equal to 1 c ...
would be called Hopper, named after computer scientist and
United States Navy The United States Navy (USN) is the naval warfare, maritime military branch, service branch of the United States Department of Defense. It is the world's most powerful navy with the largest Displacement (ship), displacement, at 4.5 millio ...
rear admiral
Grace Hopper Grace Brewster Hopper (; December 9, 1906 – January 1, 1992) was an American computer scientist, mathematician, and United States Navy rear admiral. She was a pioneer of computer programming. Hopper was the first to devise the theory of mach ...
, one of the first programmers of the
Harvard Mark I The Harvard Mark I, or IBM Automatic Sequence Controlled Calculator (ASCC), was one of the earliest general-purpose electromechanical computers used in the war effort during the last part of World War II. One of the first programs to run on th ...
. The account stated that Hopper would be based on a
multi-chip module A multi-chip module (MCM) is generically an electronic assembly (such as a package with a number of conductor terminals or Lead (electronics), "pins") where multiple integrated circuits (ICs or "chips"), semiconductor Die (integrated circuit), d ...
design, which would result in a yield gain with lower wastage. During the 2022
Nvidia GTC Nvidia GTC (GPU Technology Conference) is a global artificial intelligence (AI) conference for developers that brings together developers, engineers, researchers, inventors, and IT professionals. Topics focus on AI, computer graphics, data science ...
, Nvidia officially announced Hopper. In late 2022, due to US regulations limiting the export of chips to the
People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
, Nvidia adapted the H100 chip to the Chinese market with the H800. This model has lower bandwidth compared to the original H100 model. In late 2023, the US government announced new restrictions on the export of AI chips to China, including the A800 and H800 models. This led to Nvidia creating another chip predicated on Hopper microarchitecture: the H20, a modified version of the H100. The H20 had become the most prominent chip in the Chinese market as of 2025.https://qz.com/nvidia-ai-chips-gpus-sell-china-market-export-controls-1851772678/slides/2 By 2023, during the
AI boom The AI boom is an ongoing period of rapid Progress in artificial intelligence, progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include lar ...
, H100s were in great demand.
Larry Ellison Lawrence Joseph Ellison (born August 17, 1944) is an American businessman and entrepreneur who co-founded software company Oracle Corporation. He was Oracle's chief executive officer from 1977 to 2014 and is now its chief technology officer a ...
of
Oracle Corporation Oracle Corporation is an American Multinational corporation, multinational computer technology company headquartered in Austin, Texas. Co-founded in 1977 in Santa Clara, California, by Larry Ellison, who remains executive chairman, Oracle was ...
said that year that at a dinner with Nvidia CEO
Jensen Huang Jen-Hsun "Jensen" Huang ( zh, t=黃仁勳, poj=N̂g Jîn-hun, hp=Huáng Rénxūn; born February 17, 1963) is a Taiwanese and American businessman, electrical engineer, and philanthropist who is the president, co-founder, and chief executive of ...
, he and
Elon Musk Elon Reeve Musk ( ; born June 28, 1971) is a businessman. He is known for his leadership of Tesla, SpaceX, X (formerly Twitter), and the Department of Government Efficiency (DOGE). Musk has been considered the wealthiest person in th ...
of
Tesla, Inc. Tesla, Inc. ( or ) is an American multinational automotive and clean energy company. Headquartered in Austin, Texas, it designs, manufactures and sells battery electric vehicles (BEVs), stationary battery energy storage devices from h ...
and xAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging". In January 2024,
Raymond James Financial Raymond James Financial, Inc. is an American multinational independent investment bank and financial services company providing financial services to individuals, corporations, and municipalities through its subsidiary companies that engage pri ...
analysts estimated that Nvidia was selling the H100 GPU in the price range of $25,000 to $30,000 each, while on eBay, individual H100s cost over $40,000. As of February 2024, Nvidia was reportedly shipping H100 GPUs to data centers in armored cars.


H100 accelerator and DGX H100


References


Citations


Works cited

* * * * *


Further reading

* * * {{Nvidia Nvidia microarchitectures Nvidia Hopper