Hardware acceleration

Last updated
A cryptographic accelerator card allows cryptographic operations to be performed at a faster rate. Sun-crypto-accelerator-1000.jpg
A cryptographic accelerator card allows cryptographic operations to be performed at a faster rate.

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

Contents

To perform computing tasks more efficiently, generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreased latency, increased throughput and reduced energy consumption. Typical advantages of focusing on software may include greater versatility, more rapid development, lower non-recurring engineering costs, heightened portability, and ease of updating features or patching bugs, at the cost of overhead to compute general operations. Advantages of focusing on hardware may include speedup, reduced power consumption, [1] lower latency, increased parallelism [2] and bandwidth, and better utilization of area and functional components available on an integrated circuit; at the cost of lower ability to update designs once etched onto silicon and higher costs of functional verification, times to market, and need for more parts. In the hierarchy of digital computing systems ranging from general-purpose processors to fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by orders of magnitude when any given application is implemented higher up that hierarchy. [3] This hierarchy includes general-purpose processors such as CPUs, [4] more specialized processors such as programmable shaders in a GPU, [5] fixed-function implemented on field-programmable gate arrays (FPGAs), [6] and fixed-function implemented on application-specific integrated circuits (ASICs). [7]

Hardware acceleration is advantageous for performance, and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing control flow. [8] [9] The disadvantage however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.

Overview

Integrated circuits can be created to perform arbitrary operations on analog and digital signals. Most often in computing, signals are digital and can be interpreted as binary number data. Computer hardware and software operate on information in binary representation to perform computing; this is accomplished by calculating Boolean functions on the bits of input and outputting the result to some output device downstream for storage or further processing.

Computational equivalence of hardware and software

Because all Turing machines can run any computable function, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can be always used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software. Hardware description languages (HDLs) such as Verilog and VHDL can model the same semantics as software and synthesize the design into a netlist that can be programmed to an FPGA or composed into the logic gates of an ASIC.

Stored-program computers

The vast majority of software-based computing occurs on machines implementing the von Neumann architecture, collectively known as stored-program computers. Computer programs are stored as data and executed by processors. Such processors must fetch and decode instructions, as well as loading data operands from memory (as part of the instruction cycle) to execute the instructions constituting the software program. Relying on a common cache for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the modified Harvard architecture, where instructions and data have separate caches in the memory hierarchy, there is overhead to decoding instruction opcodes and multiplexing available execution units on a microprocessor or microcontroller, leading to low circuit utilization. Modern processors that provide simultaneous multithreading exploit under-utilization of available processor functional units and instruction level parallelism between different hardware threads.

Hardware execution units

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an instruction cycle and incur those stages' overhead. If needed calculations are specified in a register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.

This reclamation saves time, power and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication or memory, as well as increased input/output capabilities. This comes at the cost of general-purpose utility.

Emerging hardware architectures

Greater RTL customization of hardware designs allows emerging architectures such as in-memory computing, transport triggered architectures (TTA) and networks-on-chip (NoC) to further benefit from increased locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.

Custom hardware is limited in parallel processing capability only by the area and logic blocks available on the integrated circuit die. [10] Therefore, hardware is much more free to offer massive parallelism than software on general-purpose processors, offering a possibility of implementing the parallel random-access machine (PRAM) model.

It is common to build multicore and manycore processing units out of microprocessor IP core schematics on a single FPGA or ASIC. [11] [12] [13] [14] [15] Similarly, specialized functional units can be composed in parallel as in digital signal processing without being embedded in a processor IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little conditional branching, especially on large amounts of data. This is how Nvidia's CUDA line of GPUs are implemented.

Implementation metrics

As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering the characteristics such as physical hardware dimensions, power consumption and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed. [16]

Applications

Examples of hardware acceleration include bit blit acceleration functionality in graphics processing units (GPUs), use of memristors for accelerating neural networks and regular expression hardware acceleration for spam control in the server industry, intended to prevent regular expression denial of service (ReDoS) attacks. [17] The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred with a more specific term, such as 3D accelerator, or cryptographic accelerator.

Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by instruction fetch (for example moving temporary results to and from a register file). Hardware accelerators improve the execution of a specific algorithm by allowing greater concurrency, having specific datapaths for their temporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle.

Modern processors are multi-core and often feature parallel "single-instruction; multiple data" (SIMD) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like motion estimation in MPEG-2).

Hardware acceleration units by application

ApplicationHardware acceleratorAcronym
Computer graphics Graphics processing unit GPU
  • GPGPU
  • CUDA
  • RTX
  • N/A
Digital signal processing Digital signal processor DSP
Analog signal processing Field-programmable analog array FPAA
  • FPRF
Sound processing Sound card and sound card mixer N/A
Computer networking Network processor and network interface controller NPU and NIC
  • NoC
  • TCPOE or TOE
  • I/OAT or IOAT
Cryptography Cryptographic accelerator and secure cryptoprocessor N/A
Artificial intelligence AI accelerator N/A
  • VPU
  • PNN
  • N/A
Multilinear algebra Tensor processing unit TPU
Physics simulation Physics processing unit PPU
Regular expressions [17] Regular expression coprocessorN/A
Data compression [18] Data compression acceleratorN/A
In-memory processingNetwork on a chip and Systolic array NoC; N/A
Data processing Data processing unit DPU
Any computing taskComputer hardware HW (sometimes)
  • FPGA
  • ASIC
  • CPLD
  • SoC
    • MPSoC
    • PSoC

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Field-programmable gate array</span> Array of logic gates that are reprogrammable

A field-programmable gate array (FPGA) is a type of integrated circuit that can be programmed or reprogrammed after manufacturing. It consists of an array of programmable logic blocks and interconnects that can be configured to perform various digital functions. FPGAs are commonly used in applications where flexibility, speed, and parallel processing capabilities are required, such as in telecommunications, automotive, aerospace, and industrial sectors.

<span class="mw-page-title-main">System on a chip</span> Micro-electronic component

A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices and interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital and also analog, mixed-signal and often radio frequency signal processing functions.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

<span class="mw-page-title-main">Application-specific integrated circuit</span> Integrated circuit customized for a specific task

An application-specific integrated circuit is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficiency video codec. Application-specific standard product chips are intermediate between ASICs and industry standard integrated circuits like the 7400 series or the 4000 series. ASIC chips are typically fabricated using metal–oxide–semiconductor (MOS) technology, as MOS integrated circuit chips.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to add custom computational blocks using FPGAs. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric, thus providing new computational blocks without the need to manufacture and add new chips to the existing system.

<span class="mw-page-title-main">Coprocessor</span> Type of computer processor

A coprocessor is a computer processor used to supplement the functions of the primary processor. Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. By offloading processor-intensive tasks from the main processor, coprocessors can accelerate system performance. Coprocessors allow a line of computers to be customized, so that customers who do not need the extra performance do not need to pay for it.

<span class="mw-page-title-main">Graphics processing unit</span> Specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit initially designed to accelerate computer graphics and image processing. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

A physics processing unit (PPU) is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. It is an example of hardware acceleration.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">Ray-tracing hardware</span> Type of 3D graphics accelerator

Ray-tracing hardware is special-purpose computer hardware designed for accelerating ray tracing calculations.

This is a glossary of terms used in the field of Reconfigurable computing and reconfigurable computing systems, as opposed to the traditional Von Neumann architecture.

<span class="mw-page-title-main">Computer architecture</span> Set of rules describing computer system

In computer science and computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the description may include the instruction set architecture design, microarchitecture design, logic design, and implementation.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

An AI accelerator or neural processing unit is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2018, a typical AI integrated circuit chip contains billions of MOSFET transistors. A number of vendor-specific terms exist for devices in this category, and it is an emerging technology without a dominant design.

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large data center computers, initially designed to be layered on top of PCI Express, for directly connecting central processing units (CPUs) to external accelerators like graphics processing units (GPUs), ASICs, FPGAs or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

A domain-specific architecture (DSA) is a programmable computer architecture specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as CPUs, that are designed to operate on any computer program.

References

  1. "Microsoft Supercharges Bing Search With Programmable Chips". WIRED. 16 June 2014.
  2. "Embedded". Archived from the original on 2007-10-08. Retrieved 2012-08-18. "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006
  3. Sinan, Kufeoglu; Mahmut, Ozkuran (2019). "Figure 5. CPU, GPU, FPGA, and ASIC minimum energy consumption between difficulty recalculation.". Energy Consumption of Bitcoin Mining. doi: 10.17863/CAM.41230 .
  4. Kim, Yeongmin; Kong, Joonho; Munir, Arslan (2020). "CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge". IEEE Access. 8: 211422–211433. doi: 10.1109/ACCESS.2020.3039278 . ISSN   2169-3536.
  5. Lin, Yibo; Jiang, Zixuan; Gu, Jiaqi; Li, Wuxi; Dhar, Shounak; Ren, Haoxing; Khailany, Brucek; Pan, David Z. (April 2021). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement". IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 40 (4): 748–761. doi:10.1109/TCAD.2020.3003843. ISSN   1937-4151. S2CID   225744481.
  6. Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (2020-12-18). "A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units". Applied Sciences. 10 (24): 9052. doi: 10.3390/app10249052 . ISSN   2076-3417. Hardware simulation on FPGA increased the digital filter performance.
  7. Mohan, Prashanth; Wang, Wen; Jungk, Bernhard; Niederhagen, Ruben; Szefer, Jakub; Mai, Ken (October 2020). "ASIC Accelerator in 28 nm for the Post-Quantum Digital Signature Scheme XMSS". 2020 IEEE 38th International Conference on Computer Design (ICCD). Hartford, CT, USA: IEEE. pp. 656–662. doi:10.1109/ICCD50377.2020.00112. ISBN   978-1-7281-9710-4. S2CID   229330964.
  8. Morgan, Timothy Pricket (2014-09-03). "How Microsoft Is Using FPGAs To Speed Up Bing Search". Enterprise Tech. Retrieved 2018-09-18.
  9. "Project Catapult". Microsoft Research.
  10. MicroBlaze Soft Processor: Frequently Asked Questions Archived 2011-10-27 at the Wayback Machine
  11. Vassányi, István (1998). "Implementing processor arrays on FPGAs". Field-Programmable Logic and Applications from FPGAs to Computing Paradigm. Lecture Notes in Computer Science. Vol. 1482. pp. 446–450. doi:10.1007/BFb0055278. ISBN   978-3-540-64948-9.
  12. Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip".
  13. John Kent. "Micro16 Array - A Simple CPU Array"
  14. Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011.
  15. "Scientists Squeeze Over 1,000 Cores onto One Chip". 2011. Archived 2012-03-05 at the Wayback Machine
  16. Kienle, Frank; Wehn, Norbert; Meyr, Heinrich (December 2011). "On Complexity, Energy- and Implementation-Efficiency of Channel Decoders". IEEE Transactions on Communications. 59 (12): 3301–3310. arXiv: 1003.3792 . doi:10.1109/tcomm.2011.092011.100157. ISSN   0090-6778. S2CID   13863870.
  17. 1 2 "Regular Expressions in hardware" . Retrieved 17 July 2014.
  18. "Compression Accelerators - Microsoft Research". Microsoft Research. Retrieved 2017-10-07.
  19. 1 2 Farabet, Clément, et al. "Hardware accelerated convolutional neural networks for synthetic vision systems [ dead link ]." Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010.