Fermi (microarchitecture)

Nvidia Fermi
	NVIDIA GeForce GTX 590 of the GeForce 500-line of graphics-cards, was the final major iteration featuring the Fermi microarchitecture (GF110-351-A1).
Release date	April 2010
Manufactured by	TSMC
Designed by	Nvidia
Fabrication process	40 nm and 28 nm [ citation needed ]
History
Predecessor	Tesla
Successor	Kepler
Support status
	Unsupported

Last updated April 24, 2024

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm ^{[ citation needed ]}. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.

Overview

Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches. Fermi.svg — Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches.

Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

Streaming Multiprocessor (SM): composed of 32 CUDA cores (see Streaming Multiprocessor and CUDA core sections).
GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).
Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8 GB/s).
DRAM: supported up to 6 GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).
Clock frequency: 1.5 GHz (not released by NVIDIA, but estimated by Insight 64).
Peak performance: 1.5 TFlops.
Global memory clock: 2 GHz.
DRAM bandwidth: 192 GB/s.
H.264 FHD decode support.
H.265 FHD decode support (GT 730 only).^[1]

Streaming multiprocessor

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64 KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Load/Store Units

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.

Special Functions Units (SFUs)

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core

Integer Arithmetic Logic Unit (ALU)

Supports full 32-bit precision for all instructions, consistent with standard programming language requirements.^{[ which? ]} It is also optimized to efficiently support 64-bit in workstation and server models, but artificially crippled for consumer versions.

Floating Point Unit (FPU)

Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.^[2]

Fused multiply-add

Fused multiply-add (FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp scheduling

The Fermi architecture uses a two-level, distributed thread scheduler.

Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.

64-bit floating point operations require both the first two execution columns, so run at half the speed of 32-bit operations.

Dual Warp Scheduler

At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double precision instructions do not support dual dispatch with any other operation.^{[ citation needed ]}

Performance

The theoretical single-precision processing power of a Fermi GPU in GFLOPS is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz). Note that the previous generation Tesla could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized.^[3] Therefore, it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle.

The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110. However, in practice this double-precision power is only available on professional Quadro and Tesla cards, while consumer GeForce cards are capped to 1/8.^[4]

Memory

L1 cache per SM and unified L2 cache that services all operations (load, store and texture).

Registers

Each SM has 32K of 32-bit registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory

On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory). This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high bandwidth (1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.). David Patterson says that this Shared Memory uses idea of local scratchpad ^[5]

Local Memory

Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM. Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers). Generally, an automatic variable resides in a register except for the following: (1) Arrays that the compiler cannot determine are indexed with constant quantities; (2) Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests. The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory

Global memory (VRAM) is accessible by all threads directly as well as the host system over the PCIe bus. It has a high latency of 400-800 cycles.^{[ citation needed ]}

Video decompression/compression

See Nvidia NVDEC (formerly called NVCUVID) as well as Nvidia PureVideo.

The Nvidia NVENC technology was not available yet, but introduced in the successor, Kepler.

Fermi chips

GF100
GF104
GF106
GF108
GF110
GF114
GF116
GF117
GF119

Related Research Articles

Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

Tesla is the codename for a GPU microarchitecture developed by Nvidia, and released in 2006, as the successor to Curie microarchitecture. It was named after the pioneering electrical engineer Nikola Tesla. As Nvidia's first microarchitecture to implement unified shaders, it was used with GeForce 8 series, GeForce 9 series, GeForce 100 series, GeForce 200 series, and GeForce 300 series of GPUs, collectively manufactured in 90 nm, 80 nm, 65 nm, 55 nm, and 40 nm. It was also in the GeForce 405 and in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.

The GeForce 400 series is a series of graphics processing units developed by Nvidia, serving as the introduction of the Fermi microarchitecture. Its release was originally slated in November 2009, however, after delays, it was released on March 26, 2010, with availability following in April 2010.

Graphics Core Next (GCN) is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.

The GeForce 700 series is a series of graphics processing units developed by Nvidia. While mainly a refresh of the Kepler microarchitecture, some cards use Fermi (GF) and later cards use Maxwell (GM). GeForce 700 series cards were first released in 2013, starting with the release of the GeForce GTX Titan on February 19, 2013, followed by the GeForce GTX 780 on May 23, 2013. The first mobile GeForce 700 series chips were released in April 2013.

<span class="mw-page-title-main">Kepler (microarchitecture)</span> GPU microarchitecture by Nvidia

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.

<span class="mw-page-title-main">Maxwell (microarchitecture)</span> GPU microarchitecture by Nvidia

Maxwell is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Kepler microarchitecture. The Maxwell architecture was introduced in later models of the GeForce 700 series and is also used in the GeForce 800M series, GeForce 900 series, and Quadro Mxxx series, as well as some Jetson products.

<span class="mw-page-title-main">Pascal (microarchitecture)</span> GPU microarchitecture by Nvidia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced in April 2016 with the release of the Tesla P100 (GP100) on April 5, 2016, and is primarily used in the GeForce 10 series, starting with the GeForce GTX 1080 and GTX 1070, which were released on May 17, 2016, and June 10, 2016, respectively. Pascal was manufactured using TSMC's 16 nm FinFET process, and later Samsung's 14 nm FinFET process.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

Zen 2 is a computer processor microarchitecture by AMD. It is the successor of AMD's Zen and Zen+ microarchitectures, and is fabricated on the 7 nm MOSFET node from TSMC. The microarchitecture powers the third generation of Ryzen processors, known as Ryzen 3000 for the mainstream desktop chips, Ryzen 4000U/H and Ryzen 5000U for mobile applications, as Threadripper 3000 for high-end desktop systems, and as Ryzen 4000G for accelerated processing units (APUs). The Ryzen 3000 series CPUs were released on 7 July 2019, while the Zen 2-based Epyc server CPUs were released on 7 August 2019. An additional chip, the Ryzen 9 3950X, was released in November 2019.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Zen+ is the codename for a computer processor microarchitecture by AMD. It is the successor to the first gen Zen microarchitecture, and was first released in April 2018, powering the second generation of Ryzen processors, known as Ryzen 2000 for mainstream desktop systems, Threadripper 2000 for high-end desktop setups and Ryzen 3000G for accelerated processing units (APUs).

The Radeon RX Vega series is a series of graphics processors developed by AMD. These GPUs use the Graphics Core Next (GCN) 5th generation architecture, codenamed Vega, and are manufactured on 14 nm FinFET technology, developed by Samsung Electronics and licensed to GlobalFoundries. The series consists of desktop graphics cards and APUs aimed at desktops, mobile devices, and embedded applications.

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace. It's the latest generation of Nvidia Tesla.

Zen 3 is the codename for a CPU microarchitecture by AMD, released on November 5, 2020. It is the successor to Zen 2 and uses TSMC's 7 nm process for the chiplets and GlobalFoundries's 14 nm process for the I/O die on the server chips and 12 nm for desktop chips. Zen 3 powers Ryzen 5000 mainstream desktop processors and Epyc server processors. Zen 3 is supported on motherboards with 500 series chipsets; 400 series boards also saw support on select B450 / X470 motherboards with certain BIOSes. Zen 3 is the last microarchitecture before AMD switched to DDR5 memory and new sockets, which are AM5 for the desktop "Ryzen" chips alongside SP5 and SP6 for the EPYC server platform and sTRX8. According to AMD, Zen 3 has a 19% higher instructions per cycle (IPC) on average than Zen 2.

CDNA is a compute-centered graphics processing unit (GPU) microarchitecture designed by AMD for datacenters. Mostly used in the AMD Instinct line of data center graphics cards, CDNA is a successor to the Graphics Core Next (GCN) microarchitecture; the other successor being RDNA, a consumer graphics focused microarchitecture.

References

↑ "NVIDIA GPU Decoder Device Information".
↑ "NVIDIA's Next Generation CUDA Compute Architecture: Fermi" (PDF). 2009. Retrieved December 7, 2015.
↑ Glaskowsky, Peter N. (September 2009). "NVIDIA's Fermi: The First Complete GPU Computing Architecture" (PDF). p. 22. Retrieved December 6, 2015. A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM
↑ Smith, Ryan (March 26, 2010). "NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?". AnandTech . p. 6. Retrieved December 6, 2015. the GTX 400 series' FP64 performance is capped at 1/8th (12.5%) of its FP32 performance, as opposed to what the hardware natively can do of 1/2 (50%) FP32
↑ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved October 3, 2013.

General

N. Brookwood, "NVIDIA Solves the GPU Computing Puzzle."
P.N. Glaskowsky, "NVIDIA’s Fermi: The First Complete GPU Computing Architecture."
N. Whitehead, A. Fit-Florea, "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs.", 2011.
Oberman, S.F.; Siu, M.Y. (2005). "A High-Performance Area-Efficient Multifunction Interpolator". 17th IEEE Symposium on Computer Arithmetic (ARITH'05). pp. 272–279. doi:10.1109/arith.2005.7. ISBN 0-7695-2366-8. S2CID 14975421.
R. Farber, "CUDA Application Design and Development," Morgan Kaufmann, 2011.
NVIDIA Application Note "Tuning CUDA applications for Fermi".

External links

NVIDIA Fermi Architecture on Orange Owl Solutions Archived January 4, 2022, at the Wayback Machine

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "NVIDIA GPU Decoder Device Information".

[fermi_whitepaper-2] "NVIDIA's Next Generation CUDA Compute Architecture: Fermi" (PDF). 2009. Retrieved December 7, 2015.

[3] Glaskowsky, Peter N. (September 2009). "NVIDIA's Fermi: The First Complete GPU Computing Architecture" (PDF). p. 22. Retrieved December 6, 2015. A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM

[4] Smith, Ryan (March 26, 2010). "NVIDIA's GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?". AnandTech . p. 6. Retrieved December 6, 2015. the GTX 400 series' FP64 performance is capped at 1/8th (12.5%) of its FP32 performance, as opposed to what the hardware natively can do of 1/2 (50%) FP32

[5] Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved October 3, 2013.

[1]

[2]

[3]

[4]

[5]