BrookGPU

BrookGPU
Developer(s)	Stanford University
Stable release	v0.5 Beta 1 / 2007;17 years ago
Repository	svn.code.sf.net/p/brook/code/ ;
Operating system	Linux, Windows
Type	Compiler/runtime
License	BSD license (parts are under the GPL)
Website	http://graphics.stanford.edu/projects/brookgpu/

Last updated March 19, 2024

The Brook programming language and its implementation BrookGPU were early and influential attempts to enable general-purpose computing on graphics processing units.^[1]^[2] Brook, developed at Stanford University graphics group, was a compiler and runtime implementation of a stream programming language targeting modern, highly parallel GPUs such as those found on ATI or Nvidia graphics cards.

BrookGPU compiled programs written using the Brook stream programming language, which is a variant of ANSI C. It could target OpenGL v1.3+, DirectX v9+ or AMD's Close to Metal for the computational backend and ran on both Microsoft Windows and Linux. For debugging, BrookGPU could also simulate a virtual graphics card on the CPU.

Status

The last major beta release (v0.4) was in October 2004 but renewed development began and stopped again in November 2007 with a v0.5 beta 1 release.

The new features of v0.5 include a much upgraded and faster OpenGL backend which uses framebuffer objects instead of PBuffers and harmonised the code around standard OpenGL interfaces instead of using proprietary vendor extensions. GLSL support was added which brings all the functionality (complex branching and loops) previously only supported by DirectX 9 to OpenGL. In particular, this means that Brook is now just as capable on Linux as Windows.

Other improvements in the v0.5 series include multi-backend usage whereby different threads can run different Brook programs concurrently (thus maximising use of a multi-GPU setup) and SSE and OpenMP support for the CPU backend (this allows near maximal usage of modern CPUs).

Performance comparison

A like for like comparison between desktop CPUs and GPGPUs is problematic because of algorithmic & structural differences.

For example, a 2.66 GHz Intel Core 2 Duo can perform a maximum of 25 GFLOPs (25 billion single-precision floating-point operations per second) if optimally using SSE and streaming memory access so the prefetcher works perfectly. However, traditionally (due to shader program length limits) most GPGPU kernels tend to perform relatively small amounts of work on large amounts of data in parallel, so the big problem with directly executing GPGPU algorithms on desktop CPUs is vastly lower memory bandwidth as generally speaking the CPU spends most of its time waiting on RAM. As an example, dual-channel PC2-6400 DDR2 RAM can throughput about 11 Gbit/s which is around 1.5 GFLOPs maximum given that there is a total of 3 GFLOPs total bandwidth and one must both read and write. As a result, if memory bandwidth constrained, Brook's CPU backend won't exceed 2 GFLOPs. In practice, it's even lower than that most especially for anything other than float4 which is the only data type which can be SSE accelerated.

On an ATI HD 2900 XT (740 MHz core 1000 MHz memory), Brook can perform a maximum of 410 GFLOPs via its DirectX 9 backend. OpenGL is currently (due to driver and Cg compiler limitations) much less efficient as a GPGPU backend on that GPU, so Brook can only manage 210 GFLOPs when using OpenGL on that GPU. On paper, this looks like around twenty times faster than the CPU, but as just explained it isn't as easy as that. GPUs currently have major branch and read/write access penalties so expect a reasonable maximum of one third of the peak maximum in real world code - this still leaves that ATI card at around 125 GFLOPs some five times faster than the Intel Core 2 Duo.

However this discounts the important part of transferring the data to be processed to and from the GPU. With a PCI Express 1.0 x8 interface, the memory of an ATI HD 2900 XT can be written to at about 730 Mbit/s and read from at about 311 Mbit/s which is significantly slower than normal PC memory. For large datasets, this can greatly diminish the speed increase of using a GPU over a well-tuned CPU implementation. Of course, as GPUs become faster far more quickly than CPUs and the PCI Express interface improves, it will make more sense to offload large processing to GPUs.

Applications and games that use BrookGPU

Folding@home

Related Research Articles

<span class="mw-page-title-main">Graphics card</span> Expansion card which generates a feed of output images to a display device

A graphics card is a computer expansion card that generates a feed of graphics output to a display device such as a monitor. Graphics cards are sometimes called discrete or dedicated graphics cards to emphasize their distinction to integrated graphics processor on the motherboard or the CPU. A graphics processing unit (GPU) that performs the necessary computations is the main component in a graphics card, but the acronym "GPU" is sometimes also used to erroneously refer to the graphics card as a whole.

A graphics processing unit (GPU) is a specialized electronic circuit initially designed to accelerate computer graphics and image processing. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

Radeon is a brand of computer products, including graphics processing units, random-access memory, RAM disk software, and solid-state drives, produced by Radeon Technologies Group, a division of AMD. The brand was launched in 2000 by ATI Technologies, which was acquired by AMD in 2006 for US$5.4 billion.

Core Image is a pixel-accurate, near-realtime, non-destructive image processing technology in Mac OS X. Implemented as part of the QuartzCore framework of Mac OS X 10.4 and later, Core Image provides a plugin-based architecture for applying filters and effects within the Quartz graphics rendering layer. The framework was later added to iOS in iOS 5.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

A physics processing unit (PPU) is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. It is an example of hardware acceleration.

The Radeon Xpress 200 is a computer chipset released by ATI. The chipset supports AMD 64-bit processors as well as supporting Intel Pentium 4, Pentium D and Celeron processors. Additionally, it includes support for DDR400 RAM and DDR-2 667 RAM on the Intel Edition.

The Xenos is a custom graphics processing unit (GPU) designed by ATI, used in the Xbox 360 video game console developed and produced for Microsoft. Developed under the codename "C1", it is in many ways related to the R520 architecture and therefore very similar to an ATI Radeon X1800 XT series of PC graphics cards as far as features and performance are concerned. However, the Xenos introduced new design ideas that were later adopted in the TeraScale microarchitecture, such as the unified shader architecture. The package contains two separate dies, the GPU and an eDRAM, featuring a total of 337 million transistors.

CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

In computing, Close To Metal is the name of a beta version of a low-level programming interface developed by ATI, now the AMD Graphics Product Group, aimed at enabling GPGPU computing. CTM was short-lived, and the first production version of AMD's GPGPU technology is now called AMD Stream SDK, or rather the current AMD APP SDK for Windows and Linux 32-bit and 64-bit. APP stands for "Accelerated Parallel Processing" and also targets Heterogeneous System Architecture.

Unified Video Decoder is the name given to AMD's dedicated video decoding ASIC. There are multiple versions implementing a multitude of video codecs, such as H.264 and VC-1.

<span class="mw-page-title-main">Perl OpenGL</span>

Perl OpenGL (POGL) is a portable, compiled wrapper library that allows OpenGL to be used in the Perl programming language.

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

AMD FireStream was AMD's brand name for their Radeon-based product line targeting stream processing and/or GPGPU in supercomputers. Originally developed by ATI Technologies around the Radeon X1900 XTX in 2006, the product line was previously branded as both ATI FireSTREAM and AMD Stream Processor. The AMD FireStream can also be used as a floating-point co-processor for offloading CPU calculations, which is part of the Torrenza initiative. The FireStream line has been discontinued since 2012, when GPGPU workloads were entirely folded into the AMD FirePro line.

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

The Northern Islands series is a family of GPUs developed by Advanced Micro Devices (AMD) forming part of its Radeon-brand, based on the 40 nm process. Some models are based on TeraScale 2 (VLIW5), some on the new TeraScale 3 (VLIW4) introduced with them.

The graphics processing unit (GPU) codenamed the Radeon R600 is the foundation of the Radeon HD 2000/3000 series and the FireGL 2007 series video cards developed by ATI Technologies.

The graphics processing unit (GPU) codenamed Radeon R600 is the foundation of the Radeon HD 2000 series and the FireGL 2007 series video cards developed by ATI Technologies. The HD 2000 cards competed with nVidia's GeForce 8 series.

AMD Instinct is AMD's brand of professional GPUs. It replaced AMD's FirePro S brand in 2016. Compared to the Radeon brand of mainstream consumer/gamer products, the instinct product line is intended to accelerate deep learning, artificial neural network, and high-performance computing/GPGPU applications.

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), and OpenCL.

References

↑ Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: using data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. 34 (5). doi:10.1145/1168919.1168898.
↑ Che, Shuai; Boyer, Michael; Meng, Jiayuan; Tarjan, D.; Sheaffer, Jeremy W.; Skadron, Kevin (2008). "A performance study of general-purpose applications on graphics processors using CUDA". J. Parallel and Distributed Computing. 68 (10): 1370–1380. doi:10.1016/j.jpdc.2008.05.014.

External links

Official website , Stanford University

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: using data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. 34 (5). doi:10.1145/1168919.1168898.

[2] Che, Shuai; Boyer, Michael; Meng, Jiayuan; Tarjan, D.; Sheaffer, Jeremy W.; Skadron, Kevin (2008). "A performance study of general-purpose applications on graphics processors using CUDA". J. Parallel and Distributed Computing. 68 (10): 1370–1380. doi:10.1016/j.jpdc.2008.05.014.

[1]

[2]