# Hardware acceleration

Last updated

In computing, hardware acceleration is the use of computer hardware specially made to perform some functions more efficiently than is possible in software running on a general-purpose CPU . Any transformation of data or routine that can be computed, can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific hardware designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

HTTP Strict Transport Security

Computer hardware includes the physical, tangible parts or components of a computer, such as the cabinet, central processing unit, monitor, keyboard, computer data storage, graphics card, sound card, speakers and motherboard. By contrast, software is instructions that can be stored and run by hardware. Hardware is so-termed because it is "hard" or rigid with respect to changes or modifications; whereas software is "soft" because it is easy to update or change. Intermediate between software and hardware is "firmware", which is software that is strongly coupled to the particular hardware of a computer system and thus the most difficult to change but also among the most stable with respect to consistency of interface. The progression from levels of "hardness" to "softness" in computer systems parallels a progression of layers of abstraction in computing.

Computer software, or simply software, is a collection of data or computer instructions that tell the computer how to work. This is in contrast to physical hardware, from which the system is built and actually performs the work. In computer science and software engineering, computer software is all information processed by computer systems, programs and data. Computer software includes computer programs, libraries and related non-executable data, such as online documentation or digital media. Computer hardware and software require each other and neither can be realistically used on its own.

## Contents

Typical advantages of software include more rapid development (leading to faster times to market), lower non-recurring engineering costs, heightened portability, and ease of updating features or patching bugs, at the cost of overhead to compute general operations. Advantages of hardware include speedup, reduced power consumption, [1] lower latency, increased parallelism [2] and bandwidth, and better utilization of area and functional components available on an integrated circuit; at the cost of lower ability to update designs once etched onto silicon and higher costs of functional verification and times to market. In the hierarchy of digital computing systems ranging from general-purpose processors to fully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing by orders of magnitude when any given application is implemented higher up that hierarchy. [3] [4] This hierarchy includes general-purpose processors such as CPUs, more specialized processors such as GPUs, fixed-function implemented on field-programmable gate arrays (FPGAs), and fixed-function implemented on application-specific integrated circuit (ASICs).

In software engineering, a software development process is the process of dividing software development work into distinct phases to improve design, product management, and project management. It is also known as a software development life cycle (SDLC). The methodology may include the pre-definition of specific deliverables and artifacts that are created and completed by a project team to develop or maintain an application.

In commerce, time to market (TTM) is the length of time it takes from a product being conceived until its being available for sale. TTM is important in industries where products are outmoded quickly. A common assumption is that TTM matters most for first-of-a-kind products, but actually the leader often has the luxury of time, while the clock is clearly running for the followers.

Non-recurring engineering (NRE) refers to the one-time cost to research, design, develop and test a new product or product enhancement. When budgeting for a new product, NRE must be considered to analyze if a new product will be profitable. Even though a company will pay for NRE on a project only once, NRE costs can be prohibitively high and the product will need to sell well enough to produce a return on the initial investment. NRE is unlike production costs, which must be paid constantly to maintain production of a product. It is a form of fixed cost in economics terms. Once a system is designed any number of units can be manufactured without increasing NRE cost. NRE can be also formulated and paid via another commercial term called Royalty Fee. The Royalty Fee could be a percentage of sales revenue or profit or combination of these two, which have to be incorporated in a mid to long term agreement between technology supplier and the OEM.

Hardware acceleration is advantageous for performance, and practical when the functions are fixed so updates are not as needed as in software solutions. With the advent of reprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processing control flow. [5] [6] [7]

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

Fixed-function is a term canonically used to contrast 3D graphics APIs and earlier GPUs designed prior to the advent of shader-based 3D graphics APIs and GPU architectures.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to make substantial changes to the datapath itself in addition to the control flow. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric.

## Overview

Integrated circuits can be created to perform arbitrary operations on analog and digital signals. Most often in computing, signals are digital and can be interpreted as binary number data. Computer hardware and software operate on information in binary representation to perform computing; this is accomplished by calculating boolean functions on the bits of input and outputting the result to some output device downstream for storage or further processing.

An analog signal is any continuous signal for which the time-varying feature (variable) of the signal is a representation of some other time varying quantity, i.e., analogous to another time varying signal. For example, in an analog audio signal, the instantaneous voltage of the signal varies continuously with the pressure of the sound waves. It differs from a digital signal, in which the continuous quantity is a representation of a sequence of discrete values which can only take on one of a finite number of values. The term analog signal usually refers to electrical signals; however, mechanical, pneumatic, hydraulic, human speech, and other systems may also convey or be considered analog signals.

A digital signal is a signal that is being used to represent data as a sequence of discrete values; at any given time it can only take on one of a finite number of values. This contrasts with an analog signal, which represents continuous values; at any given time it represents a real number within a continuous range of values.

In mathematics and digital electronics, a binary number is a number expressed in the base-2 numeral system or binary numeral system, which uses only two symbols: typically "0" (zero) and "1" (one).

### Computational equivalence of hardware and software

Either software or hardware can compute any computable function. Custom hardware offers higher performance per watt for the same functions that can be specified in software. Hardware description languages (HDLs) such as Verilog and VHDL can model the same semantics as software and synthesize the design into a netlist that can be programmed to an FPGA or composed into logic gates of an application-specific integrated circuit.

Computable functions are the basic objects of study in computability theory. Computable functions are the formalized analogue of the intuitive notion of algorithm, in the sense that a function is computable if there exists an algorithm that can do the job of the function, i.e. given an input of the function domain it can return the corresponding output. Computable functions are used to discuss computability without referring to any concrete model of computation such as Turing machines or register machines. Any definition, however, must make reference to some specific model of computation but all valid definitions yield the same class of functions. Particular models of computability that give rise to the set of computable functions are the Turing-computable functions and the μ-recursive functions.

In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. This rate is typically measured by performance on the LINPACK benchmark when trying to compare between computing systems.

In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, and most commonly, digital logic circuits.

### Stored-program computers

The vast majority of software-based computing occurs on machines implementing the von Neumann architecture, collectively known as stored-program computers. Computer programs are stored as data and executed by processors, typically one or more CPU cores. Such processors must fetch and decode instructions as well as data operands from memory as part of the instruction cycle to execute the instructions constituting the software program. Relying on a common cache for code and data leads to the von Neumann bottleneck, a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the modified Harvard architecture, where instructions and data have separate caches in the memory hierarchy, there is overhead to decoding instruction opcodes and multiplexing available execution units on a microprocessor or microcontroller, leading to low circuit utilization. Intel's hyper-threading technology provides simultaneous multithreading by exploiting under-utilization of available processor functional units and instruction level parallelism between different hardware threads.

The von Neumann architecture—also known as the von Neumann model or Princeton architecture—is a computer architecture based on a 1945 description by the mathematician and physicist John von Neumann and others in the First Draft of a Report on the EDVAC. That document describes a design architecture for an electronic digital computer with these components:

A stored-program computer is a computer that stores program instructions in electronic memory. This contrasts with machines where the program instructions are stored on plugboards or similar mechanisms.

A computer program is a collection of instructions that performs a specific task when executed by a computer. Most computer devices require programs to function properly.

### Hardware execution units

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an instruction cycle and incur those stages' overhead. If needed calculations are specified in a register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.

This reclamation saves time, power and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication or memory, as well as increased input/output capabilities. This comes at the opportunity cost of less general-purpose utility.

### Emerging hardware architectures

Greater RTL customization of hardware designs allows emerging architectures such as in-memory computing, transport triggered architectures (TTA) and networks-on-chip (NoC) to further benefit from increased locality of data to execution context, thereby reducing computing and communication latency between modules and functional units.

Custom hardware is limited in parallel processing capability only by the area and logic blocks available on the integrated circuit die. [8] Therefore, hardware is much more free to offer massive parallelism than software on general-purpose processors, offering a possibility of implementing the parallel random-access machine (PRAM) model.

It is common to build multicore and manycore processing units out of microprocessor IP core schematics on a single FPGA or ASIC. [9] [10] [11] [12] [13] Similarly, specialized functional units can be composed in parallel as in digital signal processing without being embedded in a processor IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little conditional branching, especially on large amounts of data. This is how Nvidia's CUDA line of GPUs are implemented.

### Implementation Metrics

As device mobility has increased, the relative performance of specific acceleration protocols has required new metricizations, considering the characteristics such as physical hardware dimensions, power consumption and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed. [14]

### Summing one million integers

Suppose we wish to compute the sum of ${\displaystyle 2^{20}=1,048,576}$ integers. Assuming large integers are available as bignum large enough to hold the sum, this can be done in software by specifying (here, in C++):

constexprintN=20;constexprinttwo_to_the_N=1<<N;bignumarray_sum(conststd::array<int,two_to_the_N>&ints){bignumresult=0;for(std::size_ti=0;i<two_to_the_N;i++){result+=ints[i];}returnresult;}

This algorithm runs in linear time, ${\textstyle {\mathcal {O}}\left(n\right)}$ in Big O notation. In hardware, with sufficient area on chip, calculation can be parallelized to take only 20 time steps using the prefix sum algorithm. [15] The algorithm requires only logarithmic time, ${\textstyle {\mathcal {O}}\left(\log {n}\right)}$, and ${\textstyle {\mathcal {O}}\left(1\right)}$ space as an in-place algorithm:

parameterintN=20;parameterinttwo_to_the_N=1<<N;functionintarray_sum;inputintarray[two_to_the_N];beginfor(genvari=0;i<N;i++)beginfor(genvarj=0;j<two_to_the_N;j++)beginif(j>=(1<<i))beginarray[j]=array[j]+array[j-(1<<i)];endendendreturnarray[two_to_the_N-1];endendfunction

This example takes advantage of the greater parallel resources available in application-specific hardware than most software and general-purpose computing paradigms and architectures.

### Stream processing

Hardware acceleration can be applied to stream processing.

## Applications

Examples of hardware acceleration include bit blit acceleration functionality in graphics processing units (GPUs), use of memristors for accelerating neural networks [16] and regular expression hardware acceleration for spam control in the server industry, intended to prevent regular expression denial of service (ReDoS) attacks. [17] The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit. In the second case, it is referred to as a hardware accelerator, or often more specifically as a 3D accelerator, cryptographic accelerator, etc.

Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled by instruction fetch (for example moving temporary results to and from a register file). Hardware accelerators improve the execution of a specific algorithm by allowing greater concurrency, having specific datapaths for their temporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle.

Modern processors are multi-core and often feature parallel "single-instruction; multiple data" (SIMD) units. Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (like motion estimation in MPEG-2).

## Hardware acceleration units by application

ApplicationHardware acceleratorAcronym
Computer graphics
Graphics processing unit GPU
• GPGPU
• CUDA
• RTX
Digital signal processing Digital signal processor DSP
Analog signal processing Field-programmable analog array FPAA
• FPRF
Sound processing Sound card and sound card mixer N/A
Computer networking Network processor and network interface controller NPU and NIC
• NoC
• TCPOE or TOE
• I/OAT or IOAT
Cryptography Cryptographic accelerator and secure cryptoprocessor N/A
Artificial intelligence AI accelerator N/A
• VPU
• PNN
• N/A
Multilinear algebra Tensor processing unit TPU
Physics simulation Physics processing unit PPU
Regular expressions [17] Regular expression coprocessorN/A
Data compression [18] Data compression acceleratorN/A
In-memory processing Network on a chip and Systolic array NoC; N/A
Any computing task Computer hardware HW (sometimes)
• FPGA
• ASIC
• CPLD
• SoC
• MPSoC
• PSoC

## Related Research Articles

A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.

Processor design is the design engineering task of creating a processor, a key component of computer hardware. It is a subfield of computer engineering and electronics engineering (fabrication). The design process involves choosing an instruction set and a certain execution paradigm and results in a microarchitecture, which might be described in e.g. VHDL or Verilog. For microprocessor design, this description is then manufactured employing some of the various semiconductor device fabrication processes, resulting in a die which is bonded onto a chip carrier. This chip carrier is then soldered onto, or inserted into a socket on, a printed circuit board (PCB).

Microcode is a computer hardware technique that interposes a layer of organisation between the CPU hardware and the programmer-visible instruction set architecture of the computer. As such, the microcode is a layer of hardware-level instructions that implement higher-level machine code instructions or internal state machine sequencing in many digital processing elements. Microcode is used in general-purpose central processing units, although in current desktop CPUs it is only a fallback path for cases that the faster hardwired control unit cannot handle.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.

A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system. These components typically include a central processing unit (CPU), memory, input/output ports and secondary storage – all on a single substrate or microchip, the size of a coin. It may contain digital, analog, mixed-signal, and often radio frequency signal processing functions, depending on the application. As they are integrated on a single substrate, SoCs consume much less power and take up much less area than multi-chip designs with equivalent functionality. Because of this, SoCs are very common in the mobile computing and edge computing markets. Systems on chip are commonly used in embedded systems and the Internet of Things.

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but it's gaining broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

A digital signal processor (DSP) is a specialized microprocessor, with its architecture optimized for the operational needs of digital signal processing.

A coprocessor is a computer processor used to supplement the functions of the primary processor. Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. By offloading processor-intensive tasks from the main processor, coprocessors can accelerate system performance. Coprocessors allow a line of computers to be customized, so that customers who do not need the extra performance do not need to pay for it.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. In addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the specialization in each chip.

Stream processing is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction-set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

In integrated circuit design, hardware emulation is the process of imitating the behavior of one or more pieces of hardware with another piece of hardware, typically a special purpose emulation system. The emulation model is usually based on a hardware description language source code, which is compiled into the format used by emulation system. The goal is normally debugging and functional verification of the system being designed. Often an emulator is fast enough to be plugged into a working target system in place of a yet-to-be-built chip, so the whole system can be debugged with live data. This is a specific case of in-circuit emulation.

VideoCore is a low-power mobile multimedia processor originally developed by Alphamosaic Ltd and now owned by Broadcom. Its two-dimensional DSP architecture makes it flexible and efficient enough to decode a number of multimedia codecs in software while maintaining low power usage. The semiconductor intellectual property core has been found so far only on Broadcom SoCs.

In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Some definitions of architecture define it as describing the capabilities and programming model of a computer but not a particular implementation. In other definitions computer architecture involves instruction set architecture design, microarchitecture design, logic design, and implementation.

This is a glossary of terms relating to computer hardware – physical computer hardware, architectural issues, and peripherals.

Heterogeneous computing refers to systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

In computing, a compute kernel is a routine compiled for high throughput accelerators, separate from but used by a main program. They are sometimes called compute shaders, sharing execution units with vertex shaders and pixel shaders on GPUs, but are not limited to execution on one class of device, or graphics APIs.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

An AI accelerator is a class of microprocessor or computer system designed as hardware acceleration for artificial intelligence applications, especially artificial neural networks, machine vision and machine learning. Typical applications include algorithms for robotics, internet of things and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. A number of vendor-specific terms exist for devices in this category, and it is an emerging technology without a dominant design. AI accelerators can be found in many devices such as smartphones, tablets, and computers all around the world.

## References

1. "Microsoft Supercharges Bing Search With Programmable Chips". WIRED. 16 June 2014.
2. "Embedded". Archived from the original on 2007-10-08. Retrieved 2012-08-18. "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006
3. "Mining hardware comparison - Bitcoin" . Retrieved 17 July 2014.
4. "Non-specialized hardware comparison - Bitcoin" . Retrieved 25 February 2014.
5. "A Survey of FPGA-based Accelerators for Convolutional Neural Networks", S. Mittal, NCAA, 2018
6. Morgan, Timothy Pricket (2014-09-03). "How Microsoft Is Using FPGAs To Speed Up Bing Search". Enterprise Tech. Retrieved 2018-09-18.
7. "Project Catapult". Microsoft Research.
8. István Vassányi. "Implementing processor arrays on FPGAs". 1998.
9. Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip".
10. John Kent. "Micro16 Array - A Simple CPU Array"
11. Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011.
12. "Scientists Squeeze Over 1,000 Cores onto One Chip". 2011. Archived 2012-03-05 at the Wayback Machine
13. Kienle, Frank; Wehn, Norbert; Meyr, Heinrich (December 2011). "On Complexity, Energy- and Implementation-Efficiency of Channel Decoders". IEEE Transactions on Communications. 59 (12): 3301–3310. arXiv:. doi:10.1109/tcomm.2011.092011.100157. ISSN   0090-6778.
14. Hillis, W. Daniel; Steele, Jr., Guy L. (December 1986). "Data parallel algorithms". Communications of the ACM. 29 (12): 1170–1183. doi:10.1145/7902.7903.
15. "A Survey of ReRAM-based Architectures for Processing-in-memory and Neural Networks", S. Mittal, Machine Learning and Knowledge Extraction, 2018
16. "Regular Expressions in hardware" . Retrieved 17 July 2014.
17. "Compression Accelerators - Microsoft Research". Microsoft Research. Retrieved 2017-10-07.
18. Farabet, Clément, et al. "Hardware accelerated convolutional neural networks for synthetic vision systems." Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010.