Launched | September 20, 2022 |
---|---|
Designed by | Nvidia |
Manufactured by | |
Fabrication process | TSMC 4N |
Product Series | |
Server/datacenter |
|
Specifications | |
L1 cache | 256 KB (per SM) |
L2 cache | 50 MB |
Memory support | HBM3 |
PCIe support | PCI Express 5.0 |
Media Engine | |
Encoder(s) supported | NVENC |
History | |
Predecessor | Ampere |
Variant | Ada Lovelace (consumer and professional) |
Successor | Blackwell |
Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace. It's the latest generation of Nvidia Tesla.
Named for computer scientist and United States Navy rear admiral Grace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the Turing and Ampere microarchitectures, featuring a new streaming multiprocessor and a faster memory subsystem.
The Nvidia Hopper H100 GPU is implemented using the TSMC 4N process with 80 billion transistors. It consists of up to 144 streaming multiprocessors. [1] In SXM5, the Nvidia Hopper H100 offers better performance than PCIe. [2]
The streaming multiprocessors for Hopper improve upon the Turing and Ampere microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64. [3] The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory. [4] Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async
[5]
When parallelizing applications, developers can use thread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as distributed shared memory. Distributed shared memory may be used by an SM simultaneously with L2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the cudaFuncAttributeNonPortableClusterSizeAllowed
function, potentially at the cost of reduced number of active blocks. [6] With L2 multicasting and distributed shared memory, the required bandwidth for dynamic random-access memory read and writes is reduced. [7]
Hopper features improved single-precision floating-point format (FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including the Smith–Waterman algorithm. [6] Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical. [8]
The Nvidia Hopper H100 supports HBM3 and HBM2e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased. [9]
Hopper allows CUDA compute kernels to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus it's compressibility) may be changed at any time. The compressor will automatically choose between several compression algorithms. [9]
The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attribute cudaFuncAttributePreferredSharedMemoryCarveout
may be used to define the carveout of the L1 cache. Hopper introduces enhancements to NVLink through a new generation with faster overall communication bandwidth. [10]
Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink or PCIe writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation. [11]
The Hopper architecture math application programming interface (API) exposes functions in the SM such as __viaddmin_s16x2_relu
, which performs the per-halfword . In the Smith–Waterman algorithm, __vimax3_s16x2_relu
can be used, a three-way min or max followed by a clamp to zero. [12] Similarly, Hopper speeds up implementations of the Needleman–Wunsch algorithm. [13]
The Hopper architecture utilizes a transformer engine. [14]
The SXM5 form factor H100 has a thermal design power (TDP) of 700 watts. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt. [15]
Designed by | Nvidia |
---|---|
Manufactured by | |
Fabrication process | TSMC 4N |
Codename(s) | Grace Hopper |
Specifications | |
Compute | GPU: 132 Hopper SMs CPU: 72 Neoverse V2 cores |
Shader clock rate | 1980 MHz |
Memory support | GPU: 96 GB HBM3 or 144 GB HBM3e CPU: 480 GB LPDDR5X |
The GH200 combines a Hopper-based H200 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory. [16]
In November 2019, a well-known Twitter account posted a tweet revealing that the next architecture after Ampere would be called Hopper, named after computer scientist and United States Navy rear admiral Grace Hopper, one of the first programmers of the Harvard Mark I. The account stated that Hopper would be based on a multi-chip module design, which would result in a yield gain with lower wastage. [17]
During the 2022 Nvidia GTC, Nvidia officially announced Hopper. [18] By 2023, during the AI boom, H100s were in great demand. Larry Ellison of Oracle Corporation said that year that at a dinner with Nvidia CEO Jensen Huang, he and Elon Musk of Tesla, Inc. and xAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging". [19]
GeForce is a brand of graphics processing units (GPUs) designed by Nvidia and marketed for the performance market. As of the GeForce 40 series, there have been eighteen iterations of the design. The first GeForce products were discrete GPUs designed for add-on graphics boards, intended for the high-margin PC gaming market, and later diversification of the product line covered all tiers of the PC graphics market, ranging from cost-sensitive GPUs integrated on motherboards, to mainstream add-in retail boards. Most recently, GeForce technology has been introduced into Nvidia's line of embedded application processors, designed for electronic handhelds and mobile handsets.
Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.
Tegra is a system on a chip (SoC) series developed by Nvidia for mobile devices such as smartphones, personal digital assistants, and mobile Internet devices. The Tegra integrates an ARM architecture central processing unit (CPU), graphics processing unit (GPU), northbridge, southbridge, and memory controller onto one package. Early Tegra SoCs are designed as efficient multimedia processors. The Tegra-line evolved to emphasize performance for gaming and machine learning applications without sacrificing power efficiency, before taking a drastic shift in direction towards platforms that provide vehicular automation with the applied "Nvidia Drive" brand name on reference boards and its semiconductors; and with the "Nvidia Jetson" brand name for boards adequate for AI applications within e.g. robots or drones, and for various smart high level automation purposes.
The GeForce 400 series is a series of graphics processing units developed by Nvidia, serving as the introduction of the Fermi microarchitecture. Its release was originally slated in November 2009, however, after delays, it was released on March 26, 2010, with availability following in April 2010.
Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.
The GeForce 800M series is a family of graphics processing units by Nvidia for laptop PCs. It consists of rebrands of mobile versions of the GeForce 700 series and some newer chips that are lower end compared to the rebrands.
Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.
Nvidia Tesla is the former name for a line of products developed by Nvidia targeted at stream processing or general-purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its products began using GPUs from the G80 series, and have continued to accompany the release of new chips. They are programmable using the CUDA or OpenCL APIs.
Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, as well as some Jetson products.
NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).
Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.
Volta is the codename, but not the trademark, for a GPU microarchitecture developed by Nvidia, succeeding Pascal. It was first announced on a roadmap in March 2013, although the first product was not announced until May 2017. The architecture is named after 18th–19th century Italian chemist and physicist Alessandro Volta. It was Nvidia's first chip to feature Tensor Cores, specially designed cores that have superior deep learning performance over regular CUDA cores. The architecture is produced with TSMC's 12 nm FinFET process. The Ampere microarchitecture is the successor to Volta.
Nvidia DGX is a line of Nvidia-produced servers and workstations which specialize in using GPGPU to accelerate deep learning applications. The typical design of a DGX system is based upon a rackmount chassis with motherboard that carries high performance x86 server CPUs. The main component of a DGX system is a set of 4 to 8 Nvidia Tesla GPU modules on an independent system board. DGX systems have large heatsinks and powerful fans to adequately cool thousands of watts of thermal output. The GPU modules are typically integrated into the system using a version of the SXM socket or by a PCIe x16 slot.
A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.
Turing is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is named after the prominent mathematician and computer scientist Alan Turing. The architecture was first introduced in August 2018 at SIGGRAPH 2018 in the workstation-oriented Quadro RTX cards, and one week later at Gamescom in consumer GeForce 20 series graphics cards. Building on the preliminary work of Volta, its HPC-exclusive predecessor, the Turing architecture introduces the first consumer products capable of real-time ray tracing, a longstanding goal of the computer graphics industry. Key elements include dedicated artificial intelligence processors and dedicated ray tracing processors. Turing leverages DXR, OptiX, and Vulkan for access to ray tracing. In February 2019, Nvidia released the GeForce 16 series GPUs, which utilizes the new Turing design but lacks the RT and Tensor cores.
Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures. It was officially announced on May 14, 2020 and is named after French mathematician and physicist André-Marie Ampère.
SXM is a high bandwidth socket solution for connecting Nvidia Compute Accelerators to a system. Each generation of Nvidia Tesla since P100 models, the DGX computer series and the HGX boards come with an SXM socket type that realizes high bandwidth, power delivery and more for the matching GPU daughter cards. Nvidia offers these combinations as an end-user product e.g. in their models of the DGX system series. Current socket generations are SXM for Pascal based GPUs, SXM2 and SXM3 for Volta based GPUs, SXM4 for Ampere based GPUs, and SXM5 for Hopper based GPUs. These sockets are used for specific models of these accelerators, and offer higher performance per card than PCIe equivalents. The DGX-1 system was the first to be equipped with SXM-2 sockets and thus was the first to carry the form factor compatible SXM modules with P100 GPUs and later was unveiled to be capable of allowing upgrading to SXM2 modules with V100 GPUs.
Ada Lovelace, also referred to simply as Lovelace, is a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to the Ampere architecture, officially announced on September 20, 2022. It is named after the English mathematician Ada Lovelace, one of the first computer programmers. Nvidia announced the architecture along with the GeForce RTX 40 series consumer GPUs and the RTX 6000 Ada Generation workstation graphics card. The Lovelace architecture is fabricated on TSMC's custom 4N process which offers increased efficiency over the previous Samsung 8 nm and TSMC N7 processes used by Nvidia for its previous-generation Ampere architecture.
Blackwell is a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to the Hopper and Ada Lovelace microarchitectures.