General-purpose computing on graphics processing units

Last updated

General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). [1] [2] [3] [4] The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. [5]

Contents

Essentially, a GPGPU pipeline is a kind of parallel processing between one or more GPUs and CPUs that analyzes data as if it were in image or other graphic form. While GPUs operate at lower frequencies, they typically have many times the number of cores. Thus, GPUs can process far more pictures and graphical data per second than a traditional CPU. Migrating data into graphical form and then using the GPU to scan and analyze it can create a large speedup.

GPGPU pipelines were developed at the beginning of the 21st century for graphics processing (e.g. for better shaders). These pipelines were found to fit scientific computing needs well, and have since been developed in this direction.

The most known GPGPUs are Nvidia Tesla that are used for Nvidia DGX, alongside AMD Instinct and Intel Gaudi.

History

In principle, any arbitrary Boolean function, including addition, multiplication, and other mathematical functions, can be built up from a functionally complete set of logic operators. In 1987, Conway's Game of Life became one of the first examples of general-purpose computing using an early stream processor called a blitter to invoke a special sequence of logical operations on bit vectors. [6]

General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmable shaders and floating point support on graphics processors. Notably, problems involving matrices and/or vectors   especially two-, three-, or four-dimensional vectors  were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs. [7] [8] These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors, OpenGL and DirectX. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such as Sh/RapidMind, Brook and Accelerator. [9] [10] [11]

These were followed by Nvidia's CUDA, which allowed programmers to ignore the underlying graphical concepts in favor of more common high-performance computing concepts. [12] Newer, hardware-vendor-independent offerings include Microsoft's DirectCompute and Apple/Khronos Group's OpenCL. [12] This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.

Mark Harris, the founder of GPGPU.org, coined the term GPGPU.

Implementations

Any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework. Programming standards for parallel computing include OpenCL (vendor-independent), OpenACC, OpenMP and OpenHMPP.

As of 2016, OpenCL is the dominant open general-purpose GPU computing language, and is an open standard defined by the Khronos Group.[ citation needed ] OpenCL provides a cross-platform GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also standardised and implemented SYCL, a higher-level programming model for OpenCL as a single-source domain specific embedded language based on pure C++11.

The dominant proprietary framework is Nvidia CUDA. [13] Nvidia launched CUDA in 2006, a software development kit (SDK) and application programming interface (API) that allows using the programming language C to code algorithms for execution on GeForce 8 series and later GPUs.

ROCm, launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features, and still lacking in consumer support.

OpenVIDIA was developed at University of Toronto between 2003–2005, [14] in collaboration with Nvidia.

Altimesh Hybridizer created by Altimesh compiles Common Intermediate Language to CUDA binaries. [15] [16] It supports generics and virtual functions. [17] Debugging and profiling is integrated with Visual Studio and Nsight. [18] It is available as a Visual Studio extension on Visual Studio Marketplace.

Microsoft introduced the DirectCompute GPU computing API, released with the DirectX 11 API.

Alea GPU, [19] created by QuantAlea, [20] introduces native GPU computing capabilities for the Microsoft .NET languages F# [21] and C#. Alea GPU also provides a simplified GPU programming model based on GPU parallel-for and parallel aggregate using delegates and automatic memory management. [22]

MATLAB supports GPGPU acceleration using the Parallel Computing Toolbox and MATLAB Distributed Computing Server, [23] and third-party packages like Jacket.

GPGPU processing is also used to simulate Newtonian physics by physics engines, [24] and commercial implementations include Havok Physics, FX and PhysX, both of which are typically used for computer and video games.

C++ Accelerated Massive Parallelism (C++ AMP) is a library that accelerates execution of C++ code by exploiting the data-parallel hardware on GPUs.

Mobile computers

Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running major mobile operating systems.

Google Android 4.2 enabled running RenderScript code on the mobile device GPU. [25] Renderscript has since been deprecated in favour of first OpenGL compute shaders [26] and later Vulkan Compute. [27] OpenCL is available on may Android devices, but is not officially supported by Android[ citation needed ]. Apple introduced the proprietary Metal API for iOS applications, able to execute arbitrary code through Apple's GPU compute shaders [ citation needed ].

Hardware support

Computer video cards are produced by various vendors, such as Nvidia, AMD. Cards from such vendors differ on implementing data-format support, such as integer and floating-point formats (32-bit and 64-bit). Microsoft introduced a Shader Model standard, to help rank the various features of graphic cards into a simple Shader Model version number (1.0, 2.0, 3.0, etc.).

Integer numbers

Pre-DirectX 9 video cards only supported paletted or integer color types. Sometimes another alpha value is added, to be used for transparency. Common formats are:

Floating-point numbers

For early fixed-function or limited programmability graphics (i.e., up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as floating point data formats, to obtain effects such as high-dynamic-range imaging. Many GPGPU applications require floating point accuracy, which came with video cards conforming to the DirectX 9 specification.

DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16. ATI's Radeon R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while Nvidia's NV30 series supported both FP16 and FP32; other vendors such as S3 Graphics and XGI supported a mixture of formats up to FP24.

The implementations of floating point on Nvidia GPUs are mostly IEEE compliant; however, this is not true across all vendors. [28] This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs. Some GPU architectures sacrifice IEEE compliance, while others lack double-precision. Efforts have occurred to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computing onto the GPU in the first place. [29]

Vectorization

Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once. For example, if one color R1, G1, B1 is to be modulated by another color R2, G2, B2, the GPU can produce the resulting color R1*R2, G1*G2, B1*B2 in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).[ citation needed ] Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions, termed single instruction, multiple data (SIMD), have long been available on CPUs.[ citation needed ]

GPU vs. CPU

Originally, data was simply passed one-way from a central processing unit (CPU) to a graphics processing unit (GPU), then to a display device. As time progressed, however, it became valuable for GPUs to store at first simple, then complex structures of data to be passed back to the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format that a video card can understand. Because the GPU has access to every draw operation, it can analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much more slowly, as the speed of access between a CPU and its larger pool of random-access memory (or in an even worse case, a hard drive) is slower than GPUs and video cards, which typically contain smaller amounts of more expensive memory that is much faster to access. Transferring the portion of the data set to be actively analyzed to that GPU memory in the form of textures or other easily readable GPU forms results in speed increase. The distinguishing feature of a GPGPU design is the ability to transfer information bidirectionally back from the GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in a multiplier effect on the speed of a specific high-use algorithm.

GPGPU pipelines may improve efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in complex graphics pipelines as well as scientific computing; more so in fields with large data sets like genome mapping, or where two- or three-dimensional analysis is useful  especially at present biomolecule analysis, protein study, and other complex organic chemistry. An example of such applications is NVIDIA software suite for genome analysis.

Such pipelines can also vastly improve efficiency in image processing and computer vision, among other fields; as well as parallel processing generally. Some very heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task.

A simple example would be a GPU program that collects data about average lighting values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might use edge detection to return both numerical information and a processed image representing outlines to a computer vision program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to every pixel or other picture element in an image, it can analyze and average it (for the first example) or apply a Sobel edge filter or other convolution filter (for the second) with much greater speed than a CPU, which typically must access slower random-access memory copies of the graphic in question.

GPGPU is fundamentally a software concept, not a hardware concept; it is a type of algorithm, not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into a rack), which adds a third layer  many computing units each using many CPUs to correspond to many GPUs. Some Bitcoin "miners" used such setups for high-quantity processing.

Caches

Historically, CPUs have used hardware-managed caches, but the earlier GPUs only provided software-managed local memories. However, as GPUs are being increasingly used for general-purpose applications, state-of-the-art GPUs are being designed with hardware-managed multi-level caches which have helped the GPUs to move towards mainstream computing. For example, GeForce 200 series GT200 architecture GPUs did not feature an L2 cache, the Fermi GPU has 768 KiB last-level cache, the Kepler GPU has 1.5 MiB last-level cache, [30] the Maxwell GPU has 2 MiB last-level cache, and the Pascal GPU has 4 MiB last-level cache.

Register file

GPUs have very large register files, which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively. [31] [32] By comparison, the size of a register file on CPUs is small, typically tens or hundreds of kilobytes.

Energy efficiency

The high performance of GPUs comes at the cost of high power consumption, which under full load is in fact as much power as the rest of the PC system combined. [33] The maximum power consumption of the Pascal series GPU (Tesla P100) was specified to be 250W. [34]

Classical GPGPU

Before CUDA was published in 2007, GPGPU was "classical" and involved repurposing graphics primitives. A standard structure of such was:

  1. Load arrays into textures
  2. Draw a quadrangle
  3. Apply pixel shaders and textures to quadrangle
  4. Read out pixel values in the quadrangle as array

More examples are available in part 4 of GPU Gems 2. [35]

Linear algebra

Using GPU for numerical linear algebra began at least in 2001. [36] It had been used for Gauss-Seidel solver, conjugate gradients, etc. [37]

Stream processing

GPUs are designed specifically for graphics and thus are very restrictive in operations and programming. Due to their design, GPUs are only effective for problems that can be solved using stream processing and the hardware can only be used in certain ways.

The following discussion referring to vertices, fragments and textures concerns mainly the legacy model of GPGPU programming, where graphics APIs (OpenGL or DirectX) were used to perform general-purpose computation. With the introduction of the CUDA (Nvidia, 2007) and OpenCL (vendor-independent, 2008) general-purpose computing APIs, in new GPGPU codes it is no longer necessary to map the computation to graphics primitives. The stream processing nature of GPUs remains valid regardless of the APIs used. (See e.g., [38] )

GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors  processors that can operate in parallel by running one kernel on many records in a stream at once.

A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them.[ dubious discuss ] For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable.[ vague ]

Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup. [39]

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.

GPU programming concepts

Computational resources

There are a variety of computational resources available on the GPU:

  • Programmable processors – vertex, primitive, fragment and mainly compute pipelines allow programmer to perform kernel on streams of data
  • Rasterizer – creates fragments and interpolates per-vertex constants such as texture coordinates and color
  • Texture unit – read-only memory interface
  • Framebuffer – write-only memory interface

In fact, a program can substitute a write only texture for output instead of the framebuffer. This is done either through Render to Texture (RTT), Render-To-Backbuffer-Copy-To-Texture (RTBCTT), or the more recent stream-out.

Textures as stream

The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.

Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.

Kernels

Compute kernels can be thought of as the body of loops. For example, a programmer operating on a grid on the CPU might have code that looks like this:

// Input and output grids have 10000 x 10000 or 100 million elements.voidtransform_10k_by_10k_grid(floatin[10000][10000],floatout[10000][10000]){for(intx=0;x<10000;x++){for(inty=0;y<10000;y++){// The next line is executed 100 million timesout[x][y]=do_some_hard_work(in[x][y]);}}}

On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.

Flow control

In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs. [40] Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.

Recent[ when? ] GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such as static branch resolution, pre-computation, predication, loop splitting, [41] and Z-cull [42] can be used to achieve branching when hardware support does not exist.

GPU methods

Map

The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.

Reduce

Some computations require calculating a smaller stream (possibly a stream of only one element) from a larger stream. This is called a reduction of the stream. Generally, a reduction can be performed in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.

Scan

The scan operation, also termed parallel prefix sum , takes in a vector (stream) of data elements and an (arbitrary) associative binary function '+' with an identity element 'i'. If the input is [a0, a1, a2, a3, ...], an exclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while an inclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] and does not require an identity to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication. [38] [43] [44] [45]

Scatter

The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.

The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with another gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.

In dedicated compute kernels, scatter can be performed by indexed writes.

Gather

Gather is the reverse of scatter. After scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used. In dedicated compute kernels, gather may be performed by indexed reads. In other shaders, it is performed with texture-lookups.

Sort

The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using radix sort for integer and floating point data and coarse-grained merge sort and fine-grained sorting networks for general comparable data. [46] [47]

The search operation allows the programmer to find a given element within the stream, or possibly find neighbors of a specified element. Mostly the search method used is binary search on sorted elements.

Data structures

A variety of data structures can be represented on the GPU:

Applications

The following are some of the areas where GPUs have been used for general purpose computing:

Bioinformatics

GPGPU usage in Bioinformatics: [62] [86]

ApplicationDescriptionSupported featuresExpected speed-up†GPU‡Multi-GPU supportRelease status
BarraCUDADNA, including epigenetics, sequence mapping software [87] Alignment of short sequencing reads6–10xT 2075, 2090, K10, K20, K20XYesAvailable now, version 0.7.107f
CUDASW++Open source software for Smith-Waterman protein database searches on GPUsParallel search of Smith-Waterman database10–50xT 2075, 2090, K10, K20, K20XYesAvailable now, version 2.0.8
CUSHAWParallelized short read alignerParallel, accurate long read aligner  gapped alignments to large genomes10xT 2075, 2090, K10, K20, K20XYesAvailable now, version 1.0.40
GPU-BLASTLocal search with fast k-tuple heuristicProtein alignment according to blastp, multi CPU threads3–4xT 2075, 2090, K10, K20, K20XSingle onlyAvailable now, version 2.2.26
GPU-HMMERParallelized local and global search with profile hidden Markov modelsParallel local and global search of hidden Markov models60–100xT 2075, 2090, K10, K20, K20XYesAvailable now, version 2.3.2
mCUDA-MEMEUltrafast scalable motif discovery algorithm based on MEMEScalable motif discovery algorithm based on MEME4–10xT 2075, 2090, K10, K20, K20XYesAvailable now, version 3.0.12
SeqNFindA GPU accelerated sequence analysis toolsetReference assembly, blast, Smith–Waterman, hmm, de novo assembly400xT 2075, 2090, K10, K20, K20XYesAvailable now
UGENEOpensource Smith–Waterman for SSE/CUDA, suffix array based repeats finder and dotplotFast short read alignment6–8xT 2075, 2090, K10, K20, K20XYesAvailable now, version 1.11
WideLMFits numerous linear models to a fixed design and responseParallel linear regression on multiple similarly-shaped models150xT 2075, 2090, K10, K20, K20XYesAvailable now, version 0.1-1

Molecular dynamics

ApplicationDescriptionSupported featuresExpected speed-up†GPU‡Multi-GPU supportRelease status
Abalone Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligandsExplicit and implicit solvent, hybrid Monte Carlo 4–120xT 2075, 2090, K10, K20, K20XSingle onlyAvailable now, version 1.8.88
ACEMDGPU simulation of molecular mechanics force fields, implicit and explicit solventWritten for use on GPUs160 ns/day GPU version onlyT 2075, 2090, K10, K20, K20XYesAvailable now
AMBERSuite of programs to simulate molecular dynamics on biomoleculePMEMD: explicit and implicit solvent89.44 ns/day JAC NVET 2075, 2090, K10, K20, K20XYesAvailable now, version 12 + bugfix9
DL-POLYSimulate macromolecules, polymers, ionic systems, etc. on a distributed memory parallel computerTwo-body forces, link-cell pairs, Ewald SPME forces, Shake VV4xT 2075, 2090, K10, K20, K20XYesAvailable now, version 4.0 source only
CHARMM MD package to simulate molecular dynamics on biomolecule.Implicit (5x), explicit (2x) solvent via OpenMMTBDT 2075, 2090, K10, K20, K20XYesIn development Q4/12
GROMACS Simulate biochemical molecules with complex bond interactionsImplicit (5x), explicit (2x) solvent165 ns/Day DHFRT 2075, 2090, K10, K20, K20XSingle onlyAvailable now, version 4.6 in Q4/12
HOOMD-BlueParticle dynamics package written grounds up for GPUsWritten for GPUs2xT 2075, 2090, K10, K20, K20XYesAvailable now
LAMMPS Classical molecular dynamics packageLennard-Jones, Morse, Buckingham, CHARMM, tabulated, course grain SDK, anisotropic Gay-Bern, RE-squared, "hybrid" combinations3–18xT 2075, 2090, K10, K20, K20XYesAvailable now
NAMD Designed for high-performance simulation of large molecular systems100M atom capable6.44 ns/days STMV 585x 2050sT 2075, 2090, K10, K20, K20XYesAvailable now, version 2.9
OpenMMLibrary and application for molecular dynamics for HPC with GPUsImplicit and explicit solvent, custom forcesImplicit: 127–213 ns/day; Explicit: 18–55 ns/day DHFRT 2075, 2090, K10, K20, K20XYesAvailable now, version 4.1.1

† Expected speedups are highly dependent on system configuration. GPU performance compared against multi-core x86 CPU socket. GPU performance benchmarked on GPU supported features and may be a kernel to kernel performance comparison. For details on configuration used, view application website. Speedups as per Nvidia in-house testing or ISV's documentation.

‡ Q=Quadro GPU, T=Tesla GPU. Nvidia recommended GPUs for this application. Check with developer or ISV to obtain certification information.

See also

Related Research Articles

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in the design of modern processors and their functionalities. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming context has evolved as an extension of the classification system. Vector processing, covered by Duncan's taxonomy, is missing from Flynn's work because the Cray-1 was released in 1977: Flynn's second paper was published in 1972.

<span class="mw-page-title-main">Graphics processing unit</span> Specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit initially designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

A physics processing unit (PPU) is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. It is an example of hardware acceleration.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

In computing, the Brook programming language and its implementation BrookGPU were early and influential attempts to enable general-purpose computing on graphics processing units (GPGPU). Brook, developed at Stanford University graphics group, was a compiler and runtime implementation of a stream programming language targeting modern, highly parallel GPUs such as those found on ATI or Nvidia graphics cards.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006. When it was first introduced, the name was an acronym for Compute Unified Device Architecture, but Nvidia later dropped the common use of the acronym and now rarely expands it.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

In computing, Close To Metal is the name of a beta version of a low-level programming interface developed by ATI, now the AMD Graphics Product Group, aimed at enabling GPGPU computing. CTM was short-lived, and the first production version of AMD's GPGPU technology is now called AMD Stream SDK, or rather the current AMD APP SDK ) for Windows and Linux 32-bit and 64-bit, which also targets Heterogeneous System Architecture.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span> Canceled Intel GPGPU chip

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in the state of Washington. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

AMD FireStream was AMD's brand name for their Radeon-based product line targeting stream processing and/or GPGPU in supercomputers. Originally developed by ATI Technologies around the Radeon X1900 XTX in 2006, the product line was previously branded as both ATI FireSTREAM and AMD Stream Processor. The AMD FireStream can also be used as a floating-point co-processor for offloading CPU calculations, which is part of the Torrenza initiative. The FireStream line has been discontinued since 2012, when GPGPU workloads were entirely folded into the AMD FirePro line.

<span class="mw-page-title-main">OptiX</span> Nvidia ray tracing API using CUDA to compute on GPUs

Nvidia OptiX is a ray tracing API that was first developed around 2009. The computations are offloaded to the GPUs through either the low-level or the high-level API introduced with CUDA. CUDA is only available for Nvidia's graphics products. Nvidia OptiX is part of Nvidia GameWorks. OptiX is a high-level, or "to-the-algorithm" API, meaning that it is designed to encapsulate the entire algorithm of which ray tracing is a part, not just the ray tracing itself. This is meant to allow the OptiX engine to execute the larger algorithm with great flexibility without application-side changes.

GPULib is discontinued and unsupported software library developed by Tech-X Corporation for accelerating general-purpose scientific computations from within the Interactive Data Language (IDL) using Nvidia's CUDA platform for programming its graphics processing units (GPUs). GPULib provides basic arithmetic, array indexing, special functions, Fast Fourier Transforms (FFT), interpolation, BLAS matrix operations as well as LAPACK routines provided by MAGMA, and some image processing operations. All numeric data types provided by IDL are supported. GPULib is used in medical imaging, optics, astronomy, earth science, remote sensing, and other scientific areas.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

Multidimensional Digital Signal Processing (MDSP) refers to the extension of Digital signal processing (DSP) techniques to signals that vary in more than one dimension. While conventional DSP typically deals with one-dimensional data, such as time-varying audio signals, MDSP involves processing signals in two or more dimensions. Many of the principles from one-dimensional DSP, such as Fourier transforms and filter design, have analogous counterparts in multidimensional signal processing.

In computing, a compute kernel is a routine compiled for high throughput accelerators, separate from but used by a main program. They are sometimes called compute shaders, sharing execution units with vertex shaders and pixel shaders on GPUs, but are not limited to execution on one class of device, or graphics APIs.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

<span class="mw-page-title-main">SYCL</span> Higher-level programming standard for heterogeneous computing

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.

<span class="mw-page-title-main">Hopper (microarchitecture)</span> GPU microarchitecture designed by Nvidia

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is used alongside the Lovelace microarchitecture. It is the latest generation of the line of products formerly branded as Nvidia Tesla, now Nvidia Data Centre GPUs.

References

  1. Fung, James; Tang, Felix; Mann, Steve (7–10 October 2002). Mediated Reality Using Computer Graphics Hardware for Computer Vision (PDF). Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002). Seattle, Washington, USA. pp. 83–89. Archived from the original (PDF) on 2 April 2012.
  2. Aimone, Chris; Fung, James; Mann, Steve (2003). "An Eye Tap video-based featureless projective motion estimation assisted by gyroscopic tracking for wearable computer mediated reality". Personal and Ubiquitous Computing. 7 (5): 236–248. doi:10.1007/s00779-003-0239-6. S2CID   25168728.
  3. "Computer Vision Signal Processing on Graphics Processing Units", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004) Archived 19 August 2011 at the Wayback Machine : Montreal, Quebec, Canada, 17–21 May 2004, pp. V-93 – V-96
  4. Chitty, D. M. (2007, July). A data parallel approach to genetic programming using programmable graphics hardware Archived 8 August 2017 at the Wayback Machine . In Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 1566-1573). ACM.
  5. "Using Multiple Graphics Cards as a General Purpose Parallel Computer: Applications to Computer Vision", Proceedings of the 17th International Conference on Pattern Recognition (ICPR2004) Archived 18 July 2011 at the Wayback Machine , Cambridge, United Kingdom, 23–26 August 2004, volume 1, pages 805–808.
  6. Hull, Gerald (December 1987). "LIFE". Amazing Computing. 2 (12): 81–84.
  7. Krüger, Jens; Westermann, Rüdiger (July 2003). "Linear algebra operators for GPU implementation of numerical algorithms". ACM Transactions on Graphics. 22 (3): 908–916. doi:10.1145/882262.882363. ISSN   0730-0301.
  8. Bolz, Jeff; Farmer, Ian; Grinspun, Eitan; Schröder, Peter (July 2003). "Sparse matrix solvers on the GPU: conjugate gradients and multigrid". ACM Transactions on Graphics. 22 (3): 917–924. doi:10.1145/882262.882364. ISSN   0730-0301.
  9. Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: using data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. 34 (5). doi:10.1145/1168919.1168898.
  10. Che, Shuai; Boyer, Michael; Meng, Jiayuan; Tarjan, D.; Sheaffer, Jeremy W.; Skadron, Kevin (2008). "A performance study of general-purpose applications on graphics processors using CUDA". J. Parallel and Distributed Computing. 68 (10): 1370–1380. CiteSeerX   10.1.1.143.4849 . doi:10.1016/j.jpdc.2008.05.014.
  11. Glaser, J.; Nguyen, T. D.; Anderson, J. A.; Lui, P.; Spiga, F.; Millan, J. A.; Morse, D. C.; Glotzer, S. C. (2015). "Strong scaling of general-purpose molecular dynamics simulations on GPUs". Computer Physics Communications. 192: 97–107. arXiv: 1412.3387 . Bibcode:2015CoPhC.192...97G. doi: 10.1016/j.cpc.2015.02.028 .
  12. 1 2 Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peterson, Gregory; Dongarra, Jack (2012). "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming". Parallel Computing. 38 (8): 391–407. CiteSeerX   10.1.1.193.7712 . doi:10.1016/j.parco.2011.10.002.
  13. "OpenCL Gains Ground on CUDA". 28 February 2012. Archived from the original on 23 April 2012. Retrieved 10 April 2012. "As the two major programming frameworks for GPU computing, OpenCL and CUDA have been competing for mindshare in the developer community for the past few years."
  14. James Fung, Steve Mann, Chris Aimone, "OpenVIDIA: Parallel GPU Computer Vision Archived 23 December 2019 at the Wayback Machine ", Proceedings of the ACM Multimedia 2005, Singapore, 6–11 November 2005, pages 849–852
  15. "Hybridizer". Hybridizer. Archived from the original on 17 October 2017.
  16. "Home page". Altimesh. Archived from the original on 17 October 2017.
  17. "Hybridizer generics and inheritance". 27 July 2017. Archived from the original on 17 October 2017.
  18. "Debugging and Profiling with Hybridizer". 5 June 2017. Archived from the original on 17 October 2017.
  19. "Introduction". Alea GPU. Archived from the original on 25 December 2016. Retrieved 15 December 2016.
  20. "Home page". Quant Alea. Archived from the original on 12 December 2016. Retrieved 15 December 2016.
  21. "Use F# for GPU Programming". F# Software Foundation. Archived from the original on 18 December 2016. Retrieved 15 December 2016.
  22. "Alea GPU Features". Quant Alea. Archived from the original on 21 December 2016. Retrieved 15 December 2016.
  23. "MATLAB Adds GPGPU Support". 20 September 2010. Archived from the original on 27 September 2010.
  24. 1 2 Joselli, Mark, et al. "A new physics engine with automatic process distribution between CPU-GPU [ dead link ]." Proceedings of the 2008 ACM SIGGRAPH symposium on Video games. ACM, 2008.
  25. "Android 4.2 APIs - Android Developers". developer.android.com. Archived from the original on 26 August 2013.
  26. "Migrate scripts to OpenGL ES 3.1".
  27. "Migrate scripts to Vulkan".
  28. Harris, Mark (2005). "Mapping computational concepts to GPUs". ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05. pp. 50–es. doi:10.1145/1198555.1198768. ISBN   9781450378338. S2CID   8212423.
  29. Double precision on GPUs (Proceedings of ASIM 2005) Archived 21 August 2014 at the Wayback Machine : Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005  18th Symposium on Simulation Technique, 2005.
  30. "Nvidia-Kepler-GK110-Architecture-Whitepaper" (PDF). Archived (PDF) from the original on 21 February 2015.
  31. "Inside Pascal: Nvidia’s Newest Computing Platform Archived 7 May 2017 at the Wayback Machine "
  32. "Inside Volta: The World’s Most Advanced Data Center GPU Archived 1 January 2020 at the Wayback Machine "
  33. "https://www.tomshardware.com/reviews/geforce-radeon-power,2122.html How Much Power Does Your Graphics Card Need?"
  34. "https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf Nvidia Tesla P100 GPU Accelerator Archived 24 July 2018 at the Wayback Machine "
  35. Pharr, Matt, ed. (2006). "Part IV: General-Purpose Computation on GPUS: A Primer". GPU gems 2: programming techniques for high-performance graphics and general-purpose computation (3. print ed.). Upper Saddle River, NJ Munich: Addison-Wesley. ISBN   978-0-321-33559-3.
  36. Larsen, E. Scott; McAllister, David (10 November 2001). "Fast matrix multiplies using graphics hardware". ACM: 55–55. doi:10.1145/582034.582089. ISBN   978-1-58113-293-9.{{cite journal}}: Cite journal requires |journal= (help)
  37. Krüger, Jens; Westermann, Rüdiger (2005). "Linear algebra operators for GPU implementation of numerical algorithms". ACM Press: 234. doi:10.1145/1198555.1198795.{{cite journal}}: Cite journal requires |journal= (help)
  38. 1 2 "D. Göddeke, 2010. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Ph.D. dissertation, Technischen Universität Dortmund". Archived from the original on 16 December 2014.
  39. Asanovic, K.; Bodik, R.; Demmel, J.; Keaveny, T.; Keutzer, K.; Kubiatowicz, J.; Morgan, N.; Patterson, D.; Sen, K.; Wawrzynek, J.; Wessel, D.; Yelick, K. (2009). "A view of the parallel computing landscape". Commun. ACM. 52 (10): 56–67. doi: 10.1145/1562764.1562783 .
  40. "GPU Gems – Chapter 34, GPU Flow-Control Idioms".
  41. Future Chips. "Tutorial on removing branches", 2011
  42. GPGPU survey paper Archived 4 January 2007 at the Wayback Machine : John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. "A Survey of General-Purpose Computation on Graphics Hardware". Computer Graphics Forum, volume 26, number 1, 2007, pp. 80–113.
  43. "S. Sengupta, M. Harris, Y. Zhang, J. D. Owens, 2007. Scan primitives for GPU computing. In T. Aila and M. Segal (eds.): Graphics Hardware (2007)". Archived from the original on 5 June 2015. Retrieved 16 December 2014.
  44. Blelloch, G. E. (1989). "Scans as primitive parallel operations" (PDF). IEEE Transactions on Computers. 38 (11): 1526–1538. doi:10.1109/12.42122. Archived from the original (PDF) on 23 September 2015. Retrieved 16 December 2014.
  45. "M. Harris, S. Sengupta, J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In Nvidia: GPU Gems 3, Chapter 39".[ permanent dead link ]
  46. Merrill, Duane. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. dissertation, Department of Computer Science, University of Virginia. Dec. 2011.
  47. Sean Baxter. Modern gpu Archived 7 October 2016 at the Wayback Machine , 2013.
  48. Leung, Alan, Ondřej Lhoták, and Ghulam Lashari. "Automatic parallelization for graphics processing units." Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. ACM, 2009.
  49. Henriksen, Troels, Martin Elsman, and Cosmin E. Oancea. "Size slicing: a hybrid approach to size inference in futhark." Proceedings of the 3rd ACM SIGPLAN workshop on Functional high-performance computing. ACM, 2014.
  50. Baskaran, Muthu Manikandan; Bondhugula, Uday; Krishnamoorthy, Sriram; Ramanujam, J.; Rountev, Atanas; Sadayappan, P. (2008). "A compiler framework for optimization of affine loop nests for gpgpus". Proceedings of the 22nd annual international conference on Supercomputing - ICS '08. p. 225. doi:10.1145/1375527.1375562. ISBN   9781605581583. S2CID   6137960.
  51. "K. Crane, I. Llamas, S. Tariq, 2008. Real-Time Simulation and Rendering of 3D Fluids. In Nvidia: GPU Gems 3, Chapter 30".[ permanent dead link ]
  52. "M. Harris, 2004. Fast Fluid Dynamics Simulation on the GPU. In Nvidia: GPU Gems, Chapter 38". Archived from the original on 7 October 2017.
  53. Block, Benjamin; Virnau, Peter; Preis, Tobias (2010). "Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model". Computer Physics Communications. 181 (9): 1549–1556. arXiv: 1007.3726 . Bibcode:2010CoPhC.181.1549B. doi:10.1016/j.cpc.2010.05.005. S2CID   14828005.
  54. Sun, S.; Bauer, C.; Beichel, R. (2011). "Automated 3-D Segmentation of Lungs with Lung Cancer in CT Data Using a Novel Robust Active Shape Model Approach". IEEE Transactions on Medical Imaging. 31 (2): 449–460. doi:10.1109/TMI.2011.2171357. PMC   3657761 . PMID   21997248.
  55. Jimenez, Edward S., and Laurel J. Orr. "Rethinking the union of computed tomography reconstruction and GPGPU computing." Penetrating Radiation Systems and Applications XIV. Vol. 8854. International Society for Optics and Photonics, 2013.
  56. Sorensen, T.S.; Schaeffter, T.; Noe, K.O.; Hansen, M.S. (2008). "Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware". IEEE Transactions on Medical Imaging. 27 (4): 538–547. doi:10.1109/TMI.2007.909834. PMID   18390350. S2CID   206747049.
  57. Garcia, Vincent; Debreuve, Eric; Barlaud, Michel (2008). "Fast k Nearest Neighbor Search using GPU". arXiv: 0804.1448 [cs.CV].
  58. Cococcioni, Marco; Grasso, Raffaele; Rixen, Michel (2011). "Rapid prototyping of high performance fuzzy computing applications using high level GPU programming for maritime operations support". 2011 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA). pp. 17–23. doi:10.1109/CISDA.2011.5945947. ISBN   978-1-4244-9939-7. S2CID   2089441.
  59. Whalen, Sean (10 March 2005). Audio and the Graphics Processing Unit. CiteSeerX   10.1.1.114.365 .
  60. Wilson, Ron (3 September 2009). "DSP brings you a high-definition moon walk". EDN. Archived from the original on 22 January 2013. Retrieved 3 September 2009. Lowry is reportedly using Nvidia Tesla GPUs (graphics-processing units) programmed in the company's CUDA (Compute Unified Device Architecture) to implement the algorithms. Nvidia claims that the GPUs are approximately two orders of magnitude faster than CPU computations, reducing the processing time to less than one minute per frame.
  61. Alerstam, E.; Svensson, T.; Andersson-Engels, S. (2008). "Parallel computing with graphics processing units for high speed Monte Carlo simulation of photon migration" (PDF). Journal of Biomedical Optics. 13 (6): 060504. Bibcode:2008JBO....13f0504A. doi: 10.1117/1.3041496 . PMID   19123645. Archived (PDF) from the original on 9 August 2011.
  62. 1 2 3 Hasan, Khondker S.; Chatterjee, Amlan; Radhakrishnan, Sridhar; Antonio, John K. (2014). "Performance Prediction Model and Analysis for Compute-Intensive Tasks on GPUs" (PDF). Advanced Information Systems Engineering (PDF). Lecture Notes in Computer Science. Vol. 7908. pp. 612–617. doi:10.1007/978-3-662-44917-2_65. ISBN   978-3-642-38708-1.
  63. "Computational Physics with GPUs: Lund Observatory". www.astro.lu.se. Archived from the original on 12 July 2010.
  64. Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh (2007). "High-throughput sequence alignment using Graphics Processing Units". BMC Bioinformatics. 8: 474. doi: 10.1186/1471-2105-8-474 . PMC   2222658 . PMID   18070356.
  65. Svetlin A. Manavski; Giorgio Valle (2008). "CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment". BMC Bioinformatics. 9 (Suppl. 2): S10. doi: 10.1186/1471-2105-9-s2-s10 . PMC   2323659 . PMID   18387198.
  66. Olejnik, M; Steuwer, M; Gorlatch, S; Heider, D (15 November 2014). "gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing". Bioinformatics. 30 (22): 3272–3. doi: 10.1093/bioinformatics/btu535 . PMID   25123901.
  67. Wang, Guohui, et al. "Accelerating computer vision algorithms using OpenCL framework on the mobile GPU-a case study." 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.
  68. Boyer, Vincent; El Baz, Didier (2013). "Recent Advances on GPU Computing in Operations Research". 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (PDF). pp. 1778–1787. doi:10.1109/IPDPSW.2013.45. ISBN   978-0-7695-4979-8. S2CID   2774188.
  69. Bukata, Libor; Sucha, Premysl; Hanzalek, Zdenek (2014). "Solving the Resource Constrained Project Scheduling Problem using the parallel Tabu Search designed for the CUDA platform". Journal of Parallel and Distributed Computing. 77: 58–68. arXiv: 1711.04556 . doi:10.1016/j.jpdc.2014.11.005. S2CID   206391585.
  70. Bäumelt, Zdeněk; Dvořák, Jan; Šůcha, Přemysl; Hanzálek, Zdeněk (2016). "A Novel Approach for Nurse Rerostering based on a Parallel Algorithm". European Journal of Operational Research . 251 (2): 624–639. doi:10.1016/j.ejor.2015.11.022.
  71. CTU-IIG Archived 9 January 2016 at the Wayback Machine Czech Technical University in Prague, Industrial Informatics Group (2015).
  72. NRRPGpu Archived 9 January 2016 at the Wayback Machine Czech Technical University in Prague, Industrial Informatics Group (2015).
  73. Naju Mancheril. "GPU-based Sorting in PostgreSQL" (PDF). School of Computer Science – Carnegie Mellon University. Archived (PDF) from the original on 2 August 2011.
  74. Manavski, Svetlin A. "CUDA compatible GPU as an efficient hardware accelerator for AES cryptography Archived 7 May 2019 at the Wayback Machine ." 2007 IEEE International Conference on Signal Processing and Communications. IEEE, 2007.
  75. Harrison, Owen; Waldron, John (2007). "AES Encryption Implementation and Analysis on Commodity Graphics Processing Units". Cryptographic Hardware and Embedded Systems - CHES 2007. Lecture Notes in Computer Science. Vol. 4727. p. 209. CiteSeerX   10.1.1.149.7643 . doi:10.1007/978-3-540-74735-2_15. ISBN   978-3-540-74734-5.
  76. AES and modes of operations on SM4.0 compliant GPUs. Archived 21 August 2010 at the Wayback Machine Owen Harrison, John Waldron, Practical Symmetric Key Cryptography on Modern Graphics Hardware. In proceedings of USENIX Security 2008.
  77. Harrison, Owen; Waldron, John (2009). "Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware". Progress in Cryptology – AFRICACRYPT 2009. Lecture Notes in Computer Science. Vol. 5580. p. 350. CiteSeerX   10.1.1.155.5448 . doi:10.1007/978-3-642-02384-2_22. ISBN   978-3-642-02383-5.
  78. "Teraflop Troubles: The Power of Graphics Processing Units May Threaten the World's Password Security System". Georgia Tech Research Institute. Archived from the original on 30 December 2010. Retrieved 7 November 2010.
  79. "Want to deter hackers? Make your password longer". NBC News . 19 August 2010. Archived from the original on 11 July 2013. Retrieved 7 November 2010.
  80. Lerner, Larry (9 April 2009). "Viewpoint: Mass GPUs, not CPUs for EDA simulations". EE Times. Retrieved 14 September 2023.
  81. "W2500 ADS Transient Convolution GT". accelerates signal integrity simulations on workstations that have Nvidia Compute Unified Device Architecture (CUDA)-based Graphics Processing Units (GPU)
  82. GrAVity: A Massively Parallel Antivirus Engine Archived 27 July 2010 at the Wayback Machine . Giorgos Vasiliadis and Sotiris Ioannidis, GrAVity: A Massively Parallel Antivirus Engine. In proceedings of RAID 2010.
  83. "Kaspersky Lab utilizes Nvidia technologies to enhance protection". Kaspersky Lab. 14 December 2009. Archived from the original on 19 June 2010. During internal testing, the Tesla S1070 demonstrated a 360-fold increase in the speed of the similarity-defining algorithm when compared to the popular Intel Core 2 Duo central processor running at a clock speed of 2.6 GHz.
  84. Gnort: High Performance Network Intrusion Detection Using Graphics Processors Archived 9 April 2011 at the Wayback Machine . Giorgos Vasiliadis et al., Gnort: High Performance Network Intrusion Detection Using Graphics Processors. In proceedings of RAID 2008.
  85. Regular Expression Matching on Graphics Hardware for Intrusion Detection Archived 27 July 2010 at the Wayback Machine . Giorgos Vasiliadis et al., Regular Expression Matching on Graphics Hardware for Intrusion Detection. In proceedings of RAID 2009.
  86. "GPU-Accelerated Applications" (PDF). Archived (PDF) from the original on 25 March 2013. Retrieved 12 September 2013.
  87. Langdon, William B; Lam, Brian Yee Hong; Petke, Justyna; Harman, Mark (2015). "Improving CUDA DNA Analysis Software with Genetic Programming". Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO '15. pp. 1063–1070. doi:10.1145/2739480.2754652. ISBN   9781450334723. S2CID   8992769.

Further reading