Coherent Accelerator Processor Interface

Last updated
Coherent Accelerator Processor Interface
Year created2014;9 years ago (2014)
Created by

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large data center computers, initially designed to be layered on top of PCI Express, for directly connecting central processing units (CPUs) to external accelerators like graphics processing units (GPUs), ASICs, FPGAs or fast storage. [1] [2] It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

Contents

History

The performance scaling traditionally associated with Moore's Law—dating back to 1965—began to taper off around 2004, as both Intel's Prescott architecture and IBM's Cell processor pushed toward a 4 GHz operating frequency. Here both projects ran into a thermal scaling wall, whereby heat extraction problems associated with further increases in operating frequency largely outweighed gains from shorter cycle times.

Over the decade that followed, few commercial CPU products exceeded 4 GHz, with the majority of performance improvements now coming from incrementally improved microarchitectures, better systems integration, and higher compute density—this largely in the form of packing a larger numbers of independent cores onto the same die, often at the expense of peak operating frequency (Intel's 24-core Xeon E7-8890 from June 2016 has a base operating frequency of just 2.2 GHz, so as to operate within the constraints of a single-socket 165 W power consumption and cooling budget).

Where large performance gains have been realized, it was often associated with increasingly specialized compute units, such as GPU units added to the processor die, or external GPU- or FPGA-based accelerators. In many applications, accelerators struggle with limitations of the interconnect's performance (bandwidth and latency) or with limitations due to the interconnect's architecture (such as lacking memory coherence). Especially in the datacenter, improving the interconnect became paramount in moving toward a heterogeneous architecture in which hardware becomes increasingly tailored to specific compute workloads.

CAPI was developed to enable computers to more easily and efficiently attach specialized accelerators. Memory intensive and computation intensive works like matrix multiplications for deep neural networks can be offloaded into CAPI-supported platforms. [3] It was designed by IBM for use in its POWER8 based systems which came to market in 2014. At the same time, IBM and several other companies founded the OpenPOWER Foundation to build an ecosystem around Power based technologies, including CAPI. In October 2016 several OpenPOWER partners formed the OpenCAPI Consortium together with GPU and CPU designer AMD and systems designers Dell EMC and Hewlett Packard Enterprise to spread the technology beyond the scope of OpenPOWER and IBM. [4]

On August 1, 2022, OpenCAPI specifications and assets were transferred to the Compute Express Link (CXL) Consortium. [5]

Implementation

CAPI

CAPI is implemented as a functional unit inside the CPU, called the Coherent Accelerator Processor Proxy (CAPP) with a corresponding unit on the accelerator called the Power Service Layer (PSL). The CAPP and PSL units acts like a cache directory so the attached device and the CPU can share the same coherent memory space, and the accelerator becomes an Accelerator Function Unit (AFU), a peer to other functional units integrated in the CPU. [6] [7]

Since the CPU and AFU share the same memory space, low latency and high speeds can be achieved since the CPU doesn't have to do memory translations and memory shuffling between the CPU's main memory and the accelerator's memory spaces. An application can make use of the accelerator without specific device drivers as everything is enabled by a general CAPI kernel extension in the host operating system. The CPU and PSL can read and write directly to each other's memories and registers, as demanded by the application.

CAPI

CAPI is layered on top of PCIe Gen 3, using 16 PCIe lanes, and is an additional functionality for the PCIe slots on CAPI enabled systems. Usually there are designated CAPI enabled PCIe slots on such machines. Since there is only one CAPP per POWER8 processor the number of possible CAPI units are determined by the number of POWER8 processors, regardless of how many PCIe slots there are. In certain POWER8 systems, IBM makes use of dual chip modules, thus doubling the CAPI capacity per processor socket.

Traditional transactions between a PCIe device and a CPU can take around 20,000 operations, whereas a CAPI attached device will only use around 500, significantly reducing latency, and effectively increasing bandwidth due to decreased operations overhead. [7]

The total bandwidth of a CAPI port is determined by the underlying PCIe 3.0 x16 technology, peaking at ca 16 GB/s, bidirectional. [8]

CAPI 2

CAPI-2 is an incremental evolution of the technology introduced with IBM POWER9 processor. [8] It runs on top of PCIe Gen 4 that effectively doubles the performance to 32 GB/s. It also introduces some new features like support for DMA and Atomics from the accelerator.

OpenCAPI

The technology behind OpenCAPI is governed by the OpenCAPI Consortium, founded in October 2016 by AMD, Google, IBM, Mellanox and Micron together with partners Nvidia, Hewlett Packard Enterprise, Dell EMC and Xilinx. [9]

OpenCAPI 3

OpenCAPI, formerly New CAPI or CAPI 3.0, is not layered on top of PCIe and will therefore not use PCIe slots. In IBM's CPU POWER9 it will use the Bluelink 25G I/O facility that it shares with NVLink 2.0, peaking at 50 GB/s. [10] OpenCAPI doesn't need the PSL unit (required for CAPI 1 and 2) in the accelerator, as it's not layered on top of PCIe but uses its own transaction protocol. [11]

OpenCAPI 4

Planned for future chip after the General Availability of POWER9. [12]

OMI

OpenCAPI Memory Interface (OMI) is a serial attached RAM technology based on OpenCAPI, providing low latency, high bandwidth connection for main memory. OMI uses a controller chip on the memory modules that allows for technology agnostic approach to what is used on the modules, be it DDR4, DDR5, HBM or storage class non-volatile RAM. An OMI based CPU can therefore change RAM type by changing the memory modules.

A serial connection uses less floorspace for the interface on the CPU die therefore potentially allowing more of them compared to using common DDR memory.

OMI is implemented in IBM's Power10 CPU, which has 8 OMI memory controllers on-chip, allowing for 4 TB RAM and 410 GB/s memory bandwidth per processor. These DDIMMs (Differential Dynamic Memory Module) includes a OMI controller and memory buffer, and can address individual memory chips for fault tolerance and redundancy purposes.

Microchip Technology manufactures the OMI controller on the DDIMMs. Their SMC 1000 OpenCAPI memory is described as "the next progression in the market adopting serial attached memory." [13]

See also

Legacy

Contemporary

Related Research Articles

HyperTransport (HT), formerly known as Lightning Data Transport, is a technology for interconnection of computer processors. It is a bidirectional serial/parallel high-bandwidth, low-latency point-to-point link that was introduced on April 2, 2001. The HyperTransport Consortium is in charge of promoting and developing HyperTransport technology.

<span class="mw-page-title-main">PCI Express</span> Computer expansion bus standard

PCI Express, officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards. It is the common motherboard interface for personal computers' graphics cards, sound cards, hard disk drive host adapters, SSDs, Wi-Fi and Ethernet hardware connections. PCIe has numerous improvements over the older standards, including higher maximum system bus throughput, lower I/O pin count and smaller physical footprint, better performance scaling for bus devices, a more detailed error detection and reporting mechanism, and native hot-swap functionality. More recent revisions of the PCIe standard provide hardware support for I/O virtualization.

The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcontrollers, network appliances, application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) to set-top boxes, storage devices and supercomputers.

<span class="mw-page-title-main">Multi-chip module</span> Electronic assembly containing multiple integrated circuits that behaves as a unit

A multi-chip module (MCM) is generically an electronic assembly where multiple integrated circuits, semiconductor dies and/or other discrete components are integrated, usually onto a unifying substrate, so that in use it can be treated as if it were a larger IC. Other terms for MCM packaging include "heterogeneous integration" or "hybrid integrated circuit". The advantage of using MCM packaging is it allows a manufacturer to use multiple components for modularity and/or to improve yields over a conventional monolithic IC approach.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span>

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

<span class="mw-page-title-main">SpursEngine</span>

SpursEngine is a microprocessor from Toshiba built as a media oriented coprocessor, designed for 3D- and video processing in consumer electronics such as set-top boxes and computers. The SpursEngine processor is also known as the Quad Core HD processor. Announced 20 September 2007.

<span class="mw-page-title-main">PERCS</span>

PERCS is IBM's answer to DARPA's High Productivity Computing Systems (HPCS) initiative. The program resulted in commercial development and deployment of the Power 775, a supercomputer design with extremely high performance ratios in fabric and memory bandwidth, as well as very high performance density and power efficiency.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

<span class="mw-page-title-main">POWER8</span> 2014 family of multi-core microprocessors by IBM

POWER8 is a family of superscalar multi-core microprocessors based on the Power ISA, announced in August 2013 at the Hot Chips conference. The designs are available for licensing under the OpenPOWER Foundation, which is the first time for such availability of IBM's highest-end processors.

IBM Power microprocessors are designed and sold by IBM for servers and supercomputers. The name "POWER" was originally presented as an acronym for "Performance Optimization With Enhanced RISC". The Power line of microprocessors has been used in IBM's RS/6000, AS/400, pSeries, iSeries, System p, System i, and Power Systems lines of servers and supercomputers. They have also been used in data storage devices and workstations by IBM and by other server manufacturers like Bull and Hitachi.

<span class="mw-page-title-main">POWER9</span> 2017 family of multi-core microprocessors by IBM

POWER9 is a family of superscalar, multithreading, multi-core microprocessors produced by IBM, based on the Power ISA. It was announced in August 2016. The POWER9-based processors are being manufactured using a 14 nm FinFET process, in 12- and 24-core versions, for scale out and scale up applications, and possibly other variations, since the POWER9 architecture is open for licensing and modification by the OpenPOWER Foundation members.

The OpenPOWER Foundation is a collaboration around Power ISA-based products initiated by IBM and announced as the "OpenPOWER Consortium" on August 6, 2013. IBM is opening up technology surrounding their Power Architecture offerings, such as processor specifications, firmware and software with a liberal license, and will be using a collaborative development model with their partners.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

<span class="mw-page-title-main">NVLink</span> High speed chip interconnect

NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).

The Gen-Z Consortium is a trade group of technology vendors involved in designing CPUs, random access memory, servers, storage, and accelerators. The goal was to design an open and royalty-free "memory-semantic" bus protocol, which is not limited by the memory controller of a CPU, to be used in either a switched fabric or a point-to-point device link on a standard connector.

<span class="mw-page-title-main">AMD Instinct</span> Brand name by AMD; professional GPUs for high-performance-computing, machine learning

AMD Instinct is AMD's brand of professional GPUs. It replaced AMD's FirePro S brand in 2016. Compared to the Radeon brand of mainstream consumer/gamer products, the Instinct product line is intended to accelerate deep learning, artificial neural network, and high-performance computing/GPGPU applications.

<span class="mw-page-title-main">Power10</span> 2020 family of multi-core microprocessors by IBM

Power10 is a superscalar, multithreading, multi-core microprocessor family, based on the open source Power ISA, and announced in August 2020 at the Hot Chips conference; systems with Power10 CPUs. Generally available from September 2021 in the IBM Power10 Enterprise E1080 server.

<span class="mw-page-title-main">NEC SX-Aurora TSUBASA</span>

The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers, the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a vector host (VH) which is typically a x86-64 server running the Linux operating system. The product has been announced in a press release on 25 October 2017 and NEC has started selling it in February 2018. The product succeeds the SX-ACE.

Compute Express Link (CXL) is an open standard for high-speed, high capacity central processing unit (CPU)-to-device and CPU-to-memory connections, designed for high performance data center computers. CXL is built on the serial PCI Express (PCIe) physical and electrical interface and includes PCIe-based block input/output protocol (CXL.io) and new cache-coherent protocols for accessing system memory (CXL.cache) and device memory (CXL.mem). The serial communication and pooling capabilities allows CXL memory to overcome performance and socket packaging limitations of common DIMM memory when implementing high storage capacities.

Nvidia BlueField is a line of data processing units (DPUs) designed and produced by Nvidia. Initially developed by Mellanox Technologies, the BlueField IP was acquired by Nvidia in March 2019, when Nvidia acquired Mellanox Technologies for US$6.9 billion. The first Nvidia produced BlueField cards, named BlueField-2, were shipped for review shortly after their announcement at VMworld 2019, and were officially launched at GTC 2020. Also launched at GTC 2020 was the Nvidia BlueField-2X, an Nvidia BlueField card with an Ampere generation Graphic Processing Unit integrated onto the same card. BlueField-3 and BlueField-4 DPUs were first announced at GTC 2021, with the tentative launch dates for these cards being 2022 and 2024 respectively.

References

  1. Agam Shah (17 December 2014). "IBM's new Power8 doubles performance of Watson chip". PC World. Archived from the original on 1 February 2018. Retrieved 17 December 2014.
  2. "IBM Power8 Processor Detailed - Features 22nm Design With 12 Cores, 96 MB eDRAM L3 Cache and 4 GHz Clock Speed". WCCFtech. 27 August 2013. Retrieved 17 December 2014.
  3. Md Syadus Sefat, Semih Aslan, Jeffrey W Kellington, Apan Qasem (2019-10-03). "Accelerating HotSpots in Deep Neural Networks on a CAPI-Based FPGA". 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/Smart City/DSS). IEEE. pp. 248–256. doi:10.1109/HPCC/SmartCity/DSS.2019.00048. ISBN   978-1-7281-2058-4. S2CID   203656070.{{cite book}}: CS1 maint: multiple names: authors list (link)
  4. OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era
  5. CXL Consortium and OpenCAPI Consortium Sign Letter of Intent to Transfer OpenCAPI Specifications to CXL
  6. Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems – White Paper
  7. 1 2 Reconfigurable Accelerators for Big Data and Cloud – RAW 2016
  8. 1 2 Opening Up The Server Bus For Coherent Acceleration
  9. Tech Leaders Unite to Enable New Cloud Datacenter Server Designs for Big Data, Machine Learning, Analytics, and other Emerging Workloads
  10. Big Blue Aims For The Sky With Power9
  11. OpenCAPI Takes on PCIe, Vows 10X Improvement
  12. Stuecheli, Jeff (26 January 2017). "Webinar POWER9" (Video recording / slides). AIX Virtual User Group. - Slides (PDF) - AIX VUG page has links to slides and video
  13. Patrick Kennedy (August 5, 2019), Microchip SMC 1000 For The Serial Attached Memory Future, Servethehome