OpenCL

Last updated

OpenCL API
Original author(s) Apple Inc.
Developer(s) Khronos Group
Initial releaseAugust 28, 2009;15 years ago (2009-08-28)
Stable release
3.0.17 [1]   OOjs UI icon edit-ltr-progressive.svg / 24 October 2024;25 days ago (24 October 2024)
Written in C with C++ bindings
Operating system Android (vendor dependent), [2] FreeBSD, [3] Linux, macOS (via Pocl), Windows
Platform ARMv7, ARMv8, [4] Cell, IA-32, Power, x86-64
Type Heterogeneous computing API
License OpenCL specification license
Website www.khronos.org/opencl/
OpenCL C/C++ and C++ for OpenCL
Paradigm Imperative (procedural), structured, (C++ only) object-oriented, generic programming
Family C
Stable release
OpenCL C++ 1.0 revision V2.2–11 [5]

OpenCL C 3.0 revision V3.0.11 [6]

C++ for OpenCL 1.0 and 2021 [7]

/ December 20, 2021;2 years ago (2021-12-20)
Typing discipline Static, weak, manifest, nominal
Implementation languageImplementation specific
Filename extensions .cl .clcpp
Website www.khronos.org/opencl
Major implementations
AMD, Gallium Compute, IBM, Intel NEO, Intel SDK, Texas Instruments, Nvidia, POCL, Arm
Influenced by
C99, CUDA, C++14, C++17

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Contents

OpenCL is an open standard maintained by the Khronos Group, a non-profit, open standards organisation. Conformant implementations (passed the Conformance Test Suite) are available from a range of companies including AMD, ARM, Cadence, Google, Imagination, Intel, Nvidia, Qualcomm, Samsung, SPI and Verisilicon. [8] [9]

Overview

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units (CPUs) or "accelerators" such as graphics processing units (GPUs), attached to a host processor (a CPU). It defines a C-like language for writing programs. Functions executed on an OpenCL device are called "kernels". [10] :17 A single compute device typically consists of several compute units, which in turn comprise multiple processing elements (PEs). A single kernel execution can run on all or many of the PEs in parallel. How a compute device is subdivided into compute units and PEs is up to the vendor; a compute unit can be thought of as a "core", but the notion of core is hard to define across all the types of devices supported by OpenCL (or even within the category of "CPUs"), [11] :49–50 and the number of compute units may not correspond to the number of cores claimed in vendors' marketing literature (which may actually be counting SIMD lanes). [12]

In addition to its C-like programming language, OpenCL defines an application programming interface (API) that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is (at least conceptually) separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices. [13] The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python, [14] Java, Perl, [15] D [16] and .NET. [11] :15 An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute devices targeted.

In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation (SPIR) [17] can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end.

More recently Khronos Group has ratified SYCL, [18] a higher-level programming model for OpenCL as a single-source eDSL based on pure C++17 to improve programming productivity. People interested by C++ kernels but not by SYCL single-source programming style can use C++ features with compute kernel sources written in "C++ for OpenCL" language. [19]

Memory hierarchy

OpenCL defines a four-level memory hierarchy for the compute device: [13]

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU. [13] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

OpenCL kernel language

The programming language that is used to write compute kernels is called kernel language. OpenCL adopts C/C++-based languages to specify the kernel computations performed on the device with some restrictions and additions to facilitate efficient mapping to the heterogeneous hardware resources of accelerators. Traditionally OpenCL C was used to program the accelerators in OpenCL standard, later C++ for OpenCL kernel language was developed that inherited all functionality from OpenCL C but allowed to use C++ features in the kernel sources.

OpenCL C language

OpenCL C [20] is a C99-based language dialect adapted to fit the device model in OpenCL. Memory buffers reside in specific levels of the memory hierarchy, and pointers are annotated with the region qualifiers __global, __local, __constant, and __private, reflecting this. Instead of a device program having a main function, OpenCL C functions are marked __kernel to signal that they are entry points into the program to be called from the host program. Function pointers, bit fields and variable-length arrays are omitted, and recursion is forbidden. [21] The C standard library is replaced by a custom set of standard functions, geared toward math programming.

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups. [21] In particular, besides scalar types such as float and double, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as float4 (4-vector of single-precision floats); such vector types are available in lengths two, three, four, eight and sixteen for various base types. [20] :§ 6.1.2 Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g., SSE or VMX, when running OpenCL programs on CPUs. [13] Other specialized types include 2-d and 3-d image types. [20] :10–11

Example: matrix–vector multiplication

Each invocation (work-item) of the kernel takes a row of the green matrix (
A in the code), multiplies this row with the red vector (
x) and places the result in an entry of the blue vector (
y). The number of columns n is passed to the kernel as
ncols; the number of rows is implicit in the number of work-items produced by the host program. Matrix multiplication qtl5.svg
Each invocation (work-item) of the kernel takes a row of the green matrix (A in the code), multiplies this row with the red vector (x) and places the result in an entry of the blue vector (y). The number of columns n is passed to the kernel as ncols; the number of rows is implicit in the number of work-items produced by the host program.

The following is a matrix–vector multiplication algorithm in OpenCL C.

// Multiplies A*x, leaving the result in y.// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].__kernelvoidmatvec(__globalconstfloat*A,__globalconstfloat*x,uintncols,__globalfloat*y){size_ti=get_global_id(0);// Global id, used as the row index__globalfloatconst*a=&A[i*ncols];// Pointer to the i'th rowfloatsum=0.f;// Accumulator for dot productfor(size_tj=0;j<ncols;j++){sum+=a[j]*x[j];}y[i]=sum;}

The kernel function matvec computes, in each invocation, the dot product of a single row of a matrix A and a vector x:

To extend this into a full matrix–vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix. On the host side, the clEnqueueNDRangeKernel function does this; it takes as arguments the kernel to execute, its arguments, and a number of work-items, corresponding to the number of rows in the matrix A.

Example: computing the FFT

This example will load a fast Fourier transform (FFT) implementation and execute it. The implementation is shown below. [22] The code asks the OpenCL library for the first available graphics card, creates memory buffers for reading and writing (from the perspective of the graphics card), JIT-compiles the FFT-kernel and then finally asynchronously runs the kernel. The result from the transform is not read in this example.

#include<stdio.h>#include<time.h>#include"CL/opencl.h"#define NUM_ENTRIES 1024intmain()// (int argc, const char* argv[]){// CONSTANTS// The source code of the kernel is represented as a string// located inside file: "fft1D_1024_kernel_src.cl". For the details see the next listing.constchar*KernelSource=#include"fft1D_1024_kernel_src.cl";// Looking up the available GPUsconstcl_uintnum=1;clGetDeviceIDs(NULL,CL_DEVICE_TYPE_GPU,0,NULL,(cl_uint*)&num);cl_device_iddevices[1];clGetDeviceIDs(NULL,CL_DEVICE_TYPE_GPU,num,devices,NULL);// create a compute context with GPU devicecl_contextcontext=clCreateContextFromType(NULL,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL);// create a command queueclGetDeviceIDs(NULL,CL_DEVICE_TYPE_DEFAULT,1,devices,NULL);cl_command_queuequeue=clCreateCommandQueue(context,devices[0],0,NULL);// allocate the buffer memory objectscl_memmemobjs[]={clCreateBuffer(context,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,sizeof(float)*2*NUM_ENTRIES,NULL,NULL),clCreateBuffer(context,CL_MEM_READ_WRITE,sizeof(float)*2*NUM_ENTRIES,NULL,NULL)};// create the compute program// const char* fft1D_1024_kernel_src[1] = {  };cl_programprogram=clCreateProgramWithSource(context,1,(constchar**)&KernelSource,NULL,NULL);// build the compute program executableclBuildProgram(program,0,NULL,NULL,NULL,NULL);// create the compute kernelcl_kernelkernel=clCreateKernel(program,"fft1D_1024",NULL);// set the args valuessize_tlocal_work_size[1]={256};clSetKernelArg(kernel,0,sizeof(cl_mem),(void*)&memobjs[0]);clSetKernelArg(kernel,1,sizeof(cl_mem),(void*)&memobjs[1]);clSetKernelArg(kernel,2,sizeof(float)*(local_work_size[0]+1)*16,NULL);clSetKernelArg(kernel,3,sizeof(float)*(local_work_size[0]+1)*16,NULL);// create N-D range object with work-item dimensions and execute kernelsize_tglobal_work_size[1]={256};global_work_size[0]=NUM_ENTRIES;local_work_size[0]=64;//Nvidia: 192 or 256clEnqueueNDRangeKernel(queue,kernel,1,NULL,global_work_size,local_work_size,0,NULL,NULL);}

The actual calculation inside file "fft1D_1024_kernel_src.cl" (based on "Fitting FFT onto the G80 Architecture"): [23]

R"(// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into// calls to a radix 16 function, another radix 16 function and then a radix 4 function__kernelvoidfft1D_1024(__globalfloat2*in,__globalfloat2*out,__localfloat*sMemx,__localfloat*sMemy){inttid=get_local_id(0);intblockIdx=get_group_id(0)*1024+tid;float2data[16];// starting index of data to/from global memoryin=in+blockIdx;out=out+blockIdx;globalLoads(data,in,64);// coalesced global readsfftRadix16Pass(data);// in-place radix-16 passtwiddleFactorMul(data,tid,1024,0);// local shuffle using local memorylocalShuffle(data,sMemx,sMemy,tid,(((tid&15)*65)+(tid>>4)));fftRadix16Pass(data);// in-place radix-16 passtwiddleFactorMul(data,tid,64,4);// twiddle factor multiplicationlocalShuffle(data,sMemx,sMemy,tid,(((tid>>4)*64)+(tid&15)));// four radix-4 function callsfftRadix4Pass(data);// radix-4 function number 1fftRadix4Pass(data+4);// radix-4 function number 2fftRadix4Pass(data+8);// radix-4 function number 3fftRadix4Pass(data+12);// radix-4 function number 4// coalesced global writesglobalStores(data,out,64);})"

A full, open source implementation of an OpenCL FFT can be found on Apple's website. [24]

C++ for OpenCL language

In 2020, Khronos announced [25] the transition to the community driven C++ for OpenCL programming language [26] that provides features from C++17 in combination with the traditional OpenCL C features. This language allows to leverage a rich variety of language features from standard C++ while preserving backward compatibility to OpenCL C. This opens up a smooth transition path to C++ functionality for the OpenCL kernel code developers as they can continue using familiar programming flow and even tools as well as leverage existing extensions and libraries available for OpenCL C.

The language semantics is described in the documentation published in the releases of OpenCL-Docs [27] repository hosted by the Khronos Group but it is currently not ratified by the Khronos Group. The C++ for OpenCL language is not documented in a stand-alone document and it is based on the specification of C++ and OpenCL C. The open source Clang compiler has supported C++ for OpenCL since release 9. [28]

C++ for OpenCL has been originally developed as a Clang compiler extension and appeared in the release 9. [29] As it was tightly coupled with OpenCL C and did not contain any Clang specific functionality its documentation has been re-hosted to the OpenCL-Docs repository [27] from the Khronos Group along with the sources of other specifications and reference cards. The first official release of this document describing C++ for OpenCL version 1.0 has been published in December 2020. [30] C++ for OpenCL 1.0 contains features from C++17 and it is backward compatible with OpenCL C 2.0. In December 2021, a new provisional C++ for OpenCL version 2021 has been released which is fully compatible with the OpenCL 3.0 standard. [31] A work in progress draft of the latest C++ for OpenCL documentation can be found on the Khronos website. [32]

Features

C++ for OpenCL supports most of the features (syntactically and semantically) from OpenCL C except for nested parallelism and blocks. [33] However, there are minor differences in some supported features mainly related to differences in semantics between C++ and C. For example, C++ is more strict with the implicit type conversions and it does not support the restrict type qualifier. [33] The following C++ features are not supported by C++ for OpenCL: virtual functions, dynamic_cast operator, non-placement new/delete operators, exceptions, pointer to member functions, references to functions, C++ standard libraries. [33] C++ for OpenCL extends the concept of separate memory regions (address spaces) from OpenCL C to C++ features – functional casts, templates, class members, references, lambda functions, and operators. Most of C++ features are not available for the kernel functions e.g. overloading or templating, arbitrary class layout in parameter type. [33]

Example: complex-number arithmetic

The following code snippet illustrates how kernels with complex-number arithmetic can be implemented in C++ for OpenCL language with convenient use of C++ features.

// Define a class Complex, that can perform complex-number computations with// various precision when different types for T are used - double, float, half.template<typenameT>classcomplex_t{Tm_re;// Real component.Tm_im;// Imaginary component.public:complex_t(Tre,Tim):m_re{re},m_im{im}{};// Define operator for complex-number multiplication.complex_toperator*(constcomplex_t&other)const{return{m_re*other.m_re-m_im*other.m_im,m_re*other.m_im+m_im*other.m_re};}Tget_re()const{returnm_re;}Tget_im()const{returnm_im;}};// A helper function to compute multiplication over complex numbers read from// the input buffer and to store the computed result into the output buffer.template<typenameT>voidcompute_helper(__globalT*in,__globalT*out){autoidx=get_global_id(0);// Every work-item uses 4 consecutive items from the input buffer// - two for each complex number.autooffset=idx*4;autonum1=complex_t{in[offset],in[offset+1]};autonum2=complex_t{in[offset+2],in[offset+3]};// Perform complex-number multiplication.autores=num1*num2;// Every work-item writes 2 consecutive items to the output buffer.out[idx*2]=res.get_re();out[idx*2+1]=res.get_im();}// This kernel is used for complex-number multiplication in single precision.__kernelvoidcompute_sp(__globalfloat*in,__globalfloat*out){compute_helper(in,out);}#ifdef cl_khr_fp16// This kernel is used for complex-number multiplication in half precision when// it is supported by the device.#pragma OPENCL EXTENSION cl_khr_fp16: enable__kernelvoidcompute_hp(__globalhalf*in,__globalhalf*out){compute_helper(in,out);}#endif

Tooling and execution environment

C++ for OpenCL language can be used for the same applications or libraries and in the same way as OpenCL C language is used. Due to the rich variety of C++ language features, applications written in C++ for OpenCL can express complex functionality more conveniently than applications written in OpenCL C and in particular generic programming paradigm from C++ is very attractive to the library developers.

C++ for OpenCL sources can be compiled by OpenCL drivers that support cl_ext_cxx_for_opencl extension. [34] Arm has announced support for this extension in December 2020. [35] However, due to increasing complexity of the algorithms accelerated on OpenCL devices, it is expected that more applications will compile C++ for OpenCL kernels offline using stand alone compilers such as Clang [36] into executable binary format or portable binary format e.g. SPIR-V. [37] Such an executable can be loaded during the OpenCL applications execution using a dedicated OpenCL API. [38]

Binaries compiled from sources in C++ for OpenCL 1.0 can be executed on OpenCL 2.0 conformant devices. Depending on the language features used in such kernel sources it can also be executed on devices supporting earlier OpenCL versions or OpenCL 3.0.

Aside from OpenCL drivers kernels written in C++ for OpenCL can be compiled for execution on Vulkan devices using clspv [39] compiler and clvk [40] runtime layer just the same way as OpenCL C kernels.

Contributions

C++ for OpenCL is an open language developed by the community of contributors listed in its documentation. [32]  New contributions to the language semantic definition or open source tooling support are accepted from anyone interested as soon as they are aligned with the main design philosophy and they are reviewed and approved by the experienced contributors. [19]

History

OpenCL was initially developed by Apple Inc., which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008, the Khronos Compute Working Group was formed [41] with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008. [42] This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008. [43]

OpenCL 1.0

OpenCL 1.0 released with Mac OS X Snow Leopard on August 28, 2009. According to an Apple press release: [44]

Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework. [45] [46] RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface. [47] On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit. [48] On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers. [49]

Acceleration of calculations with factor to 1000 are possible with OpenCL in graphic cards against normal CPU. [50] Some important features of next Version of OpenCL are optional in 1.0 like double- or half-precision operations. [51]

OpenCL 1.1

OpenCL 1.1 was ratified by the Khronos Group on June 14, 2010, [52] and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

OpenCL 1.2

On November 15, 2011, the Khronos Group announced the OpenCL 1.2 specification, [53] which added significant functionality over the previous versions in terms of performance and features for parallel programming. Most notable features include:

OpenCL 2.0

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification. [55] Updates and additions to OpenCL 2.0 include:

OpenCL 2.1

The ratification and release of the OpenCL 2.1 provisional specification was announced on March 3, 2015, at the Game Developer Conference in San Francisco. It was released on November 16, 2015. [56] It introduced the OpenCL C++ kernel language, based on a subset of C++14, while maintaining support for the preexisting OpenCL C kernel language. Vulkan and OpenCL 2.1 share SPIR-V as an intermediate representation allowing high-level language front-ends to share a common compilation target. Updates to the OpenCL API include:

AMD, ARM, Intel, HPC, and YetiWare have declared support for OpenCL 2.1. [57] [58]

OpenCL 2.2

OpenCL 2.2 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel programming productivity. [59] [60] [61] It was released on May 16, 2017. [62] Maintenance Update released in May 2018 with bugfixes. [63]

OpenCL 3.0

The OpenCL 3.0 specification was released on September 30, 2020, after being in preview since April 2020. OpenCL 1.2 functionality has become a mandatory baseline, while all OpenCL 2.x and OpenCL 3.0 features were made optional. The specification retains the OpenCL C language and deprecates the OpenCL C++ Kernel Language, replacing it with the C++ for OpenCL language [19] based on a Clang/LLVM compiler which implements a subset of C++17 and SPIR-V intermediate code. [64] [65] [66] Version 3.0.7 of C++ for OpenCL with some Khronos openCL extensions were presented at IWOCL 21. [67] Actual is 3.0.11 with some new extensions and corrections. NVIDIA, working closely with the Khronos OpenCL Working Group, improved Vulkan Interop with semaphores and memory sharing. [68] Last minor update was 3.0.14 with bugfix and a new extension for multiple devices. [69]

Roadmap

The International Workshop on OpenCL (IWOCL) held by the Khronos Group IWOCL2017.jpg
The International Workshop on OpenCL (IWOCL) held by the Khronos Group

When releasing OpenCL 2.2, the Khronos Group announced that OpenCL would converge where possible with Vulkan to enable OpenCL software deployment flexibility over both APIs. [70] [71] This has been now demonstrated by Adobe's Premiere Rush using the clspv [39] open source compiler to compile significant amounts of OpenCL C kernel code to run on a Vulkan runtime for deployment on Android. [72] OpenCL has a forward looking roadmap independent of Vulkan, with 'OpenCL Next' under development and targeting release in 2020. OpenCL Next may integrate extensions such as Vulkan / OpenCL Interop, Scratch-Pad Memory Management, Extended Subgroups, SPIR-V 1.4 ingestion and SPIR-V Extended debug info. OpenCL is also considering Vulkan-like loader and layers and a "flexible profile" for deployment flexibility on multiple accelerator types. [73]

Open source implementations

clinfo, a command-line tool to see OpenCL information Clinfo screenshot.png
clinfo, a command-line tool to see OpenCL information

OpenCL consists of a set of headers and a shared object that is loaded at runtime. An installable client driver (ICD) must be installed on the platform for every class of vendor for which the runtime would need to support. That is, for example, in order to support Nvidia devices on a Linux platform, the Nvidia ICD would need to be installed such that the OpenCL runtime (the ICD loader) would be able to locate the ICD for the vendor and redirect the calls appropriately. The standard OpenCL header is used by the consumer application; calls to each function are then proxied by the OpenCL runtime to the appropriate driver using the ICD. Each vendor must implement each OpenCL call in their driver. [74]

The Apple, [75] Nvidia, [76] ROCm, RapidMind [77] and Gallium3D [78] implementations of OpenCL are all based on the LLVM Compiler technology and use the Clang compiler as their frontend.

MESA Gallium Compute
An implementation of OpenCL (actual 1.1 incomplete, mostly done AMD Radeon GCN) for a number of platforms is maintained as part of the Gallium Compute Project, [79] which builds on the work of the Mesa project to support multiple platforms. Formerly this was known as CLOVER., [80] actual development: mostly support for running incomplete framework with actual LLVM and CLANG, some new features like fp16 in 17.3, [81] Target complete OpenCL 1.0, 1.1 and 1.2 for AMD and Nvidia. New Basic Development is done by Red Hat with SPIR-V also for Clover. [82] [83] New Target is modular OpenCL 3.0 with full support of OpenCL 1.2. Actual state is available in Mesamatrix. Image supports are here in the focus of development.
RustiCL is a new implementation for Gallium compute with Rust instead of C. In Mesa 22.2 experimental implementation is available with openCL 3.0-support and image extension implementation for programs like Darktable. [84] Intel Xe (Arc) and AMD GCN+ are supported in Mesa 22.3+. AMD R600 and Nvidia Kepler+ are also target of hardware support. [85] [86] [87] RustiCL outperform AMD ROCM with Radeon RX 6700 XT hardware at Luxmark Benchmark. [88] Mesa 23.1 supports official RustiCL. In Mesa 23.2 support of important fp64 is at experimental level.
Microsoft's Windows 11 on Arm added support for OpenCL 1.2 via CLon12, an open source OpenCL implementation on top DirectX 12 via Mesa Gallium. [89] [90] [91]
BEIGNET
An implementation by Intel for its Ivy Bridge + hardware was released in 2013. [92] This software from Intel's China Team, has attracted criticism from developers at AMD and Red Hat, [93] as well as Michael Larabel of Phoronix. [94] Actual Version 1.3.2 support OpenCL 1.2 complete (Ivy Bridge and higher) and OpenCL 2.0 optional for Skylake and newer. [95] [96] support for Android has been added to Beignet., [97] actual development targets: only support for 1.2 and 2.0, road to OpenCL 2.1, 2.2, 3.0 is gone to NEO.
NEO
An implementation by Intel for Gen. 8 Broadwell + Gen. 9 hardware released in 2018. [98] This driver replaces Beignet implementation for supported platforms (not older 6.gen to Haswell). NEO provides OpenCL 2.1 support on Core platforms and OpenCL 1.2 on Atom platforms. [99] Actual in 2020 also Graphic Gen 11 Ice Lake and Gen 12 Tiger Lake are supported. New OpenCL 3.0 is available for Alder Lake, Tiger Lake to Broadwell with Version 20.41+. It includes now optional OpenCL 2.0, 2.1 Features complete and some of 2.2.
ROCm
Created as part of AMD's GPUOpen, ROCm (Radeon Open Compute) is an open source Linux project built on OpenCL 1.2 with language support for 2.0. The system is compatible with all modern AMD CPUs and APUs (actual partly GFX 7, GFX 8 and 9), as well as Intel Gen7.5+ CPUs (only with PCI 3.0). [100] [101] With version 1.9 support is in some points extended experimental to Hardware with PCIe 2.0 and without atomics. An overview of actual work is done on XDC2018. [102] [103] ROCm Version 2.0 supports Full OpenCL 2.0, but some errors and limitations are on the todo list. [104] [105] Version 3.3 is improving in details. [106] Version 3.5 does support OpenCL 2.2. [107] Version 3.10 was with improvements and new APIs. [108] Announced at SC20 is ROCm 4.0 with support of AMD Compute Card Instinct MI 100. [109] Actual documentation of 5.5.1 and before is available at GitHub. [110] [111] [112] OpenCL 3.0 is available. RocM 5.5.x+ supports only GFX 9 Vega and later, so alternative are older RocM Releases or in future RustiCL for older Hardware.
POCL
A portable implementation supporting CPUs and some GPUs (via CUDA and HSA). Building on Clang and LLVM. [113] With version 1.0 OpenCL 1.2 was nearly fully implemented along with some 2.x features. [114] Version 1.2 is with LLVM/CLANG 6.0, 7.0 and Full OpenCL 1.2 support with all closed tickets in Milestone 1.2. [114] [115] OpenCL 2.0 is nearly full implemented. [116] Version 1.3 Supports Mac OS X. [117] Version 1.4 includes support for LLVM 8.0 and 9.0. [118] Version 1.5 implements LLVM/Clang 10 support. [119] Version 1.6 implements LLVM/Clang 11 support and CUDA Acceleration. [120] Actual targets are complete OpenCL 2.x, OpenCL 3.0 and improvement of performance. POCL 1.6 is with manual optimization at the same level of Intel compute runtime. [121] Version 1.7 implements LLVM/Clang 12 support and some new OpenCL 3.0 features. [122] Version 1.8 implements LLVM/Clang 13 support. [123] Version 3.0 implements OpenCL 3.0 at minimum level and LLVM/Clang 14. [124] Version 3.1 works with LLVM/Clang 15 and improved Spir-V support. [125]
Shamrock
A Port of Mesa Clover for ARM with full support of OpenCL 1.2, [126] [127] no actual development for 2.0.
FreeOCL
A CPU focused implementation of OpenCL 1.2 that implements an external compiler to create a more reliable platform, [128] no actual development.
MOCL
An OpenCL implementation based on POCL by the NUDT researchers for Matrix-2000 was released in 2018. The Matrix-2000 architecture is designed to replace the Intel Xeon Phi accelerators of the TianHe-2 supercomputer. This programming framework is built on top of LLVM v5.0 and reuses some code pieces from POCL as well. To unlock the hardware potential, the device runtime uses a push-based task dispatching strategy and the performance of the kernel atomics is improved significantly. This framework has been deployed on the TH-2A system and is readily available to the public. [129] Some of the software will next ported to improve POCL. [114]
VC4CL
An OpenCL 1.2 implementation for the VideoCore IV (BCM2763) processor used in the Raspberry Pi before its model 4. [130]

Vendor implementations

Timeline of vendor implementations

Devices

As of 2016, OpenCL runs on graphics processing units (GPUs), CPUs with SIMD instructions, FPGAs, Movidius Myriad 2, Adapteva Epiphany and DSPs.

Khronos Conformance Test Suite

To be officially conformant, an implementation must pass the Khronos Conformance Test Suite (CTS), with results being submitted to the Khronos Adopters Program. [175] The Khronos CTS code for all OpenCL versions has been available in open source since 2017. [176]

Conformant products

The Khronos Group maintains an extended list of OpenCL-conformant products. [4]

Synopsis of OpenCL conformant products [4]
AMD SDKs (supports OpenCL CPU and APU devices), (GPU: Terascale 1: OpenCL 1.1, Terascale 2: 1.2, GCN 1: 1.2+, GCN 2+: 2.0+) X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit, [177] Linux 2.6 PC, Windows Vista/7/8.x/10 PC AMD Fusion E-350, E-240, C-50, C-30 with HD 6310/HD 6250AMD Radeon/Mobility HD 6800, HD 5x00 series GPU, iGPU HD 6310/HD 6250, HD 7xxx, HD 8xxx, R2xx, R3xx, RX 4xx, RX 5xx, Vega SeriesAMD FirePro Vx800 series GPU and later, Radeon Pro
Intel SDK for OpenCL Applications 2013 [178] (supports Intel Core processors and Intel HD Graphics 4000/2500) 2017 R2 with OpenCL 2.1 (Gen7+), SDK 2019 removed OpenCL 2.1, [179] Actual SDK 2020 update 3 Intel CPUs with SSE 4.1, SSE 4.2 or AVX support. [180] [181] Microsoft Windows, Linux Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3, 3rd Generation Intel Core Processors with Intel HD Graphics 4000/2500 and newer Intel Core 2 Solo, Duo Quad, Extreme and newer Intel Xeon 7x00,5x00,3x00 (Core based) and newer
IBM Servers with OpenCL Development Kit Archived August 9, 2011, at the Wayback Machine for Linux on Power running on Power VSX [182] [183] IBM Power 775 (PERCS), 750 IBM BladeCenter PS70x ExpressIBM BladeCenter JS2x, JS43IBM BladeCenter QS22
IBM OpenCL Common Runtime (OCR) Archived June 14, 2011, at the Wayback Machine

[184]

X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit; [185] Linux 2.6 PC AMD Fusion, Nvidia Ion and Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3AMD Radeon, Nvidia GeForce and Intel Core 2 Solo, Duo, Quad, Extreme ATI FirePro, Nvidia Quadro and Intel Xeon 7x00,5x00,3x00 (Core based)
Nvidia OpenCL Driver and Tools, [186] Chips: Tesla : OpenCL 1.1(Driver 340), Fermi : OpenCL 1.1(Driver 390), Kepler : OpenCL 1.2 (Driver 470), OpenCL 2.0 beta (378.66), OpenCL 3.0: Maxwell to Ada Lovelace (Driver 525+) Nvidia Tesla C/D/SNvidia GeForce GTS/GT/GTX,Nvidia IonNvidia Quadro FX/NVX/Plex, Quadro, Quadro K, Quadro M, Quadro P, Quadro with Volta, Quadro RTX with Turing, Ampere

All standard-conformant implementations can be queried using one of the clinfo tools (there are multiple tools with the same name and similar feature set). [187] [188] [189]

Version support

Products and their version of OpenCL support include: [190]

OpenCL 3.0 support

All hardware with OpenCL 1.2+ is possible, OpenCL 2.x only optional, Khronos Test Suite available since 2020-10 [191] [192]

  • (2020) Intel NEO Compute: 20.41+ for Gen 12 Tiger Lake to Broadwell (include full 2.0 and 2.1 support and parts of 2.2) [193]
  • (2020) Intel 6th, 7th, 8th, 9th, 10th, 11th gen processors (Skylake, Kaby Lake, Coffee Lake, Comet Lake, Ice Lake, Tiger Lake) with latest Intel Windows graphics driver
  • (2021) Intel 11th, 12th gen processors (Rocket Lake, Alder Lake) with latest Intel Windows graphics driver
  • (2021) Arm Mali-G78, Mali-G310, Mali-G510, Mali-G610, Mali-G710 and Mali-G78AE.
  • (2022) Intel 13th gen processors (Raptor Lake) with latest Intel Windows graphics driver
  • (2022) Intel Arc discrete graphics with latest Intel Arc Windows graphics driver
  • (2021) Nvidia Maxwell, Pascal, Volta, Turing and Ampere with Nvidia graphics driver 465+. [172]
  • (2022) Nvidia Ada Lovelace with Nvidia graphics driver 525+.
  • (2022) Samsung Xclipse 920 GPU (based on AMD RDNA2)
  • (2023) Intel 14th gen processors (Raptor Lake) Refresh with latest Intel Windows graphics driver
  • (2023) Intel Core Ultra Series 1 processors (Meteor Lake) with latest Intel Windows graphics driver

OpenCL 2.2 support

None yet: Khronos Test Suite ready, with Driver Update all Hardware with 2.0 and 2.1 support possible

  • Intel NEO Compute: Work in Progress for actual products [194]
  • ROCm: Version 3.5+ mostly

OpenCL 2.1 support

OpenCL 2.0 support

  • (2011+) AMD GCN GPU's (HD 7700+/HD 8000/Rx 200/Rx 300/Rx 400/Rx 500/Rx 5000-Series), some GCN 1st Gen only 1.2 with some Extensions
  • (2013+) AMD GCN APU's (Jaguar, Steamroller, Puma, Excavator & Zen-based)
  • (2014+) Intel 5th & 6th gen processors (Broadwell, Skylake)
  • (2015+) Qualcomm Adreno 5xx series
  • (2018+) Qualcomm Adreno 6xx series
  • (2017+) ARM Mali (Bifrost) G51 and G71 in Android 7.1 and Linux
  • (2018+) ARM Mali (Bifrost) G31, G52, G72 and G76
  • (2017+) incomplete Evaluation support: Nvidia Kepler, Maxwell, Pascal, Volta and Turing GPU's (GeForce 600, 700, 800, 900 & 10-series, Quadro K-, M- & P-series, Tesla K-, M- & P-series) with Driver Version 378.66+

OpenCL 1.2 support

  • (2011+) for some AMD GCN 1st Gen some OpenCL 2.0 Features not possible today, but many more Extensions than Terascale
  • (2009+) AMD TeraScale 2 & 3 GPU's (RV8xx, RV9xx in HD 5000, 6000 & 7000 Series)
  • (2011+) AMD TeraScale APU's (K10, Bobcat & Piledriver-based)
  • (2012+) Nvidia Kepler, Maxwell, Pascal, Volta and Turing GPU's (GeForce 600, 700, 800, 900, 10, 16, 20 series, Quadro K-, M- & P-series, Tesla K-, M- & P-series)
  • (2012+) Intel 3rd & 4th gen processors (Ivy Bridge, Haswell)
  • (2013+) Qualcomm Adreno 4xx series
  • (2013+) ARM Mali Midgard 3rd gen (T760)
  • (2015+) ARM Mali Midgard 4th gen (T8xx)

OpenCL 1.1 support

  • (2008+) some AMD TeraScale 1 GPU's (RV7xx in HD4000-series)
  • (2008+) Nvidia Tesla, Fermi GPU's (GeForce 8, 9, 100, 200, 300, 400, 500-series, Quadro-series or Tesla-series with Tesla or Fermi GPU)
  • (2011+) Qualcomm Adreno 3xx series
  • (2012+) ARM Mali Midgard 1st and 2nd gen (T-6xx, T720)

OpenCL 1.0 support

  • mostly updated to 1.1 and 1.2 after first Driver for 1.0 only

Portability, performance and alternatives

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution (PTX) for Nvidia GPUs unless they are willing to give up direct portability on other platforms. It is possible to run any OpenCL kernel on any conformant implementation.

However, performance of the kernel is not necessarily portable across platforms. Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem, [195] yielding "acceptable levels of performance" in experimental linear algebra kernels. [196] Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs. [197]

A study at Delft University from 2011 that compared CUDA programs and their straightforward translation into OpenCL C found CUDA to outperform OpenCL by at most 30% on the Nvidia implementation. The researchers noted that their comparison could be made fairer by applying manual optimizations to the OpenCL programs, in which case there was "no reason for OpenCL to obtain worse performance than CUDA". The performance differences could mostly be attributed to differences in the programming model (especially the memory model) and to NVIDIA's compiler optimizations for CUDA compared to those for OpenCL. [195]

Another study at D-Wave Systems Inc. found that "The OpenCL kernel’s performance is between about 13% and 63% slower, and the end-to-end time is between about 16% and 67% slower" than CUDA's performance. [198]

The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices. [199] This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices. Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support-vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance. [200]

In a comparison of actual graphic cards of AMD RDNA 2 and Nvidia RTX Series there is an undecided result by OpenCL-Tests. Possible performance increases from the use of Nvidia CUDA or OptiX were not tested. [201]

See also

Related Research Articles

<span class="mw-page-title-main">OpenGL</span> Cross-platform graphics API

OpenGL is a cross-language, cross-platform application programming interface (API) for rendering 2D and 3D vector graphics. The API is typically used to interact with a graphics processing unit (GPU), to achieve hardware-accelerated rendering.

<span class="mw-page-title-main">LLVM</span> Compiler backend for multiple programming languages

LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.

The Khronos Group, Inc. is an open, non-profit, member-driven consortium of 170 organizations developing, publishing and maintaining royalty-free interoperability standards for 3D graphics, virtual reality, augmented reality, parallel computation, vision acceleration and machine learning. The open standards and associated conformance tests enable software applications and middleware to effectively harness authoring and accelerated playback of dynamic media across a wide variety of platforms and devices. The group is based in Beaverton, Oregon.

<span class="mw-page-title-main">Mesa (computer graphics)</span> Free and open-source library for 3D graphics rendering

Mesa, also called Mesa3D and The Mesa 3D Graphics Library, is an open source implementation of OpenGL, Vulkan, and other graphics API specifications. Mesa translates these specifications to vendor-specific graphics hardware drivers.

<span class="mw-page-title-main">Free and open-source graphics device driver</span> Software that controls computer-graphics hardware

A free and open-source graphics device driver is a software stack which controls computer-graphics hardware and supports graphics-rendering application programming interfaces (APIs) and is released under a free and open-source software license. Graphics device drivers are written for specific hardware to work within a specific operating system kernel and to support a range of APIs used by applications to access the graphics hardware. They may also control output to the display if the display driver is part of the graphics hardware. Most free and open-source graphics device drivers are developed by the Mesa project. The driver is made up of a compiler, a rendering API, and software which manages access to the graphics hardware.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

nouveau (software) Open source software driver for Nvidia GPU

nouveau is a free and open-source graphics device driver for Nvidia video cards and the Tegra family of SoCs written by independent software engineers, with minor help from Nvidia employees.

AMD FireStream was AMD's brand name for their Radeon-based product line targeting stream processing and/or GPGPU in supercomputers. Originally developed by ATI Technologies around the Radeon X1900 XTX in 2006, the product line was previously branded as both ATI FireSTREAM and AMD Stream Processor. The AMD FireStream can also be used as a floating-point co-processor for offloading CPU calculations, which is part of the Torrenza initiative. The FireStream line has been discontinued since 2012, when GPGPU workloads were entirely folded into the AMD FirePro line.

X-Video Bitstream Acceleration (XvBA), designed by AMD Graphics for its Radeon GPU and APU, is an arbitrary extension of the X video extension (Xv) for the X Window System on Linux operating-systems. XvBA API allows video programs to offload portions of the video decoding process to the GPU video-hardware. Currently, the portions designed to be offloaded by XvBA onto the GPU are currently motion compensation (MC) and inverse discrete cosine transform (IDCT), and variable-length decoding (VLD) for MPEG-2, MPEG-4 ASP, MPEG-4 AVC (H.264), WMV3, and VC-1 encoded video.

<span class="mw-page-title-main">WebCL</span>

WebCL is a JavaScript binding to OpenCL for heterogeneous parallel computing within any compatible web browser without the use of plug-ins, first announced in March 2011. It is developed on similar grounds as OpenCL and is considered as a browser version of the latter. Primarily, WebCL allows web applications to actualize speed with multi-core CPUs and GPUs. With the growing popularity of applications that need parallel processing like image editing, augmented reality applications and sophisticated gaming, it has become more important to improve the computational speed. With these background reasons, a non-profit Khronos Group designed and developed WebCL, which is a Javascript binding to OpenCL with a portable kernel programming, enabling parallel computing on web browsers, across a wide range of devices. In short, WebCL consists of two parts, one being Kernel programming, which runs on the processors (devices) and the other being JavaScript, which binds the web application to OpenCL. The completed and ratified specification for WebCL 1.0 was released on March 19, 2014.

OpenACC is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems.

C++ Accelerated Massive Parallelism is a native programming model that contains elements that span the C++ programming language and its runtime library. It provides an easy way to write programs that compile and execute on data-parallel hardware, such as graphics cards (GPUs).

Heterogeneous System Architecture (HSA) is a cross-vendor set of specifications that allow for the integration of central processing units and graphics processors on the same bus, with shared memory and tasks. The HSA is being developed by the HSA Foundation, which includes AMD and ARM. The platform's stated aim is to reduce communication latency between CPUs, GPUs and other compute devices, and make these various devices more compatible from a programmer's perspective, relieving the programmer of the task of planning the moving of data between devices' disjoint memories.

OpenVX is an open, royalty-free standard for cross-platform acceleration of computer vision applications. It is designed by the Khronos Group to facilitate portable, optimized and power-efficient processing of methods for vision algorithms. This is aimed for embedded and real-time programs within computer vision and related scenarios. It uses a connected graph representation of operations.

Vulkan is a low-level, low-overhead cross-platform API and open standard for 3D graphics and computing. It was intended to address the shortcomings of OpenGL, and allow developers more control over the GPU. It is designed to support a wide variety of GPUs, CPUs and operating systems, and it is also designed to work with modern multi-core CPUs.

<span class="mw-page-title-main">Standard Portable Intermediate Representation</span> Internal code for computer graphics

Standard Portable Intermediate Representation (SPIR) is an intermediate language for parallel computing and graphics by Khronos Group. It is used in multiple execution environments, including the Vulkan graphics API and the OpenCL compute API, to represent a shader or kernel. It is also used as an interchange language for cross compilation.

<span class="mw-page-title-main">GPUOpen</span> Middleware software suite

GPUOpen is a middleware software suite originally developed by AMD's Radeon Technologies Group that offers advanced visual effects for computer games. It was released in 2016. GPUOpen serves as an alternative to, and a direct competitor of Nvidia GameWorks. GPUOpen is similar to GameWorks in that it encompasses several different graphics technologies as its main components that were previously independent and separate from one another. However, GPUOpen is partially open source software, unlike GameWorks which is proprietary and closed.

<span class="mw-page-title-main">SYCL</span> Higher-level programming standard for heterogeneous computing

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.

<span class="mw-page-title-main">ROCm</span> Parallel computing platform: GPGPU libraries and application programming interface

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP, and OpenCL.

oneAPI (compute acceleration) Open standard for parallel computing

oneAPI is an open standard, adopted by Intel, for a unified application programming interface (API) intended to be used across different computing accelerator (coprocessor) architectures, including GPUs, AI accelerators and field-programmable gate arrays. It is intended to eliminate the need for developers to maintain separate code bases, multiple programming languages, tools, and workflows for each architecture.

References

  1. "The OpenCL Specification".
  2. "Android Devices With OpenCL support". Google Docs. ArrayFire. Retrieved April 28, 2015.
  3. "FreeBSD Graphics/OpenCL". FreeBSD. Retrieved December 23, 2015.
  4. 1 2 3 4 5 "Conformant Products". Khronos Group. Retrieved May 9, 2015.
  5. Sochacki, Bartosz (July 19, 2019). "The OpenCL C++ 1.0 Specification" (PDF). Khronos OpenCL Working Group. Retrieved July 19, 2019.
  6. Munshi, Aaftab; Howes, Lee; Sochaki, Barosz (April 27, 2020). "The OpenCL C Specification Version: 3.0 Document Revision: V3.0.7" (PDF). Khronos OpenCL Working Group. Archived from the original (PDF) on September 20, 2020. Retrieved April 28, 2021.
  7. "The C++ for OpenCL 1.0 and 2021 Programming Language Documentation". Khronos OpenCL Working Group. December 20, 2021. Retrieved December 2, 2022.
  8. "Conformant Companies". Khronos Group. Retrieved September 19, 2024.
  9. Gianelli, Silvia E. (January 14, 2015). "Xilinx SDAccel Development Environment for OpenCL, C, and C++, Achieves Khronos Conformance". PR Newswire. Xilinx. Retrieved April 27, 2015.
  10. Howes, Lee (November 11, 2015). "The OpenCL Specification Version: 2.1 Document Revision: 23" (PDF). Khronos OpenCL Working Group. Retrieved November 16, 2015.
  11. 1 2 Gaster, Benedict; Howes, Lee; Kaeli, David R.; Mistry, Perhaad; Schaa, Dana (2012). Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition. Morgan Kaufmann.
  12. Tompson, Jonathan; Schlachter, Kristofer (2012). "An Introduction to the OpenCL Programming Model" (PDF). New York University Media Research Lab. Archived from the original (PDF) on July 6, 2015. Retrieved July 6, 2015.
  13. 1 2 3 4 Stone, John E.; Gohara, David; Shi, Guochin (2010). "OpenCL: a parallel programming standard for heterogeneous computing systems". Computing in Science & Engineering. 12 (3): 66–73. Bibcode:2010CSE....12c..66S. doi:10.1109/MCSE.2010.69. PMC   2964860 . PMID   21037981.
  14. Klöckner, Andreas; Pinto, Nicolas; Lee, Yunsup; Catanzaro, Bryan; Ivanov, Paul; Fasih, Ahmed (2012). "PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation". Parallel Computing. 38 (3): 157–174. arXiv: 0911.3456 . doi:10.1016/j.parco.2011.09.001. S2CID   18928397.
  15. "OpenCL - Open Computing Language Bindings". metacpan.org. Retrieved August 18, 2018.
  16. "D binding for OpenCL". dlang.org. Retrieved June 29, 2021.
  17. "SPIR – The first open standard intermediate language for parallel compute and graphics". Khronos Group. January 21, 2014.
  18. "SYCL – C++ Single-source Heterogeneous Programming for OpenCL". Khronos Group. January 21, 2014. Archived from the original on January 18, 2021. Retrieved October 24, 2016.
  19. 1 2 3 "C++ for OpenCL, OpenCL-Guide". GitHub. Retrieved April 18, 2021.
  20. 1 2 3 Munshi, Aaftab, ed. (2014). "The OpenCL C Specification, Version 2.0" (PDF). Retrieved June 24, 2014.
  21. 1 2 "Introduction to OpenCL Programming 201005" (PDF). AMD. pp. 89–90. Archived from the original (PDF) on May 16, 2011. Retrieved August 8, 2017.
  22. "OpenCL" (PDF). SIGGRAPH2008. August 14, 2008. Archived from the original (PDF) on February 16, 2012. Retrieved August 14, 2008.
  23. "Fitting FFT onto G80 Architecture" (PDF). Vasily Volkov and Brian Kazian, UC Berkeley CS258 project report. May 2008. Retrieved November 14, 2008.
  24. "OpenCL_FFT". Apple. June 26, 2012. Retrieved June 18, 2022.
  25. Trevett, Neil (April 28, 2020). "Khronos Announcements and Panel Discussion" (PDF).
  26. Stulova, Anastasia; Hickey, Neil; van Haastregt, Sven; Antognini, Marco; Petit, Kevin (April 27, 2020). "The C++ for OpenCL Programming Language". Proceedings of the International Workshop on OpenCL. IWOCL '20. Munich, Germany: Association for Computing Machinery. pp. 1–2. doi:10.1145/3388333.3388647. ISBN   978-1-4503-7531-3. S2CID   216554183.
  27. 1 2 KhronosGroup/OpenCL-Docs, The Khronos Group, April 16, 2021, retrieved April 18, 2021
  28. "Clang release 9 documentation, OpenCL support". releases.llvm.org. September 2019. Retrieved April 18, 2021.
  29. "Clang 9, Language Extensions, OpenCL". releases.llvm.org. September 2019. Retrieved April 18, 2021.
  30. "Release of Documentation of C++ for OpenCL kernel language, version 1.0, revision 1 · KhronosGroup/OpenCL-Docs". GitHub. December 2020. Retrieved April 18, 2021.
  31. "Release of Documentation of C++ for OpenCL kernel language, version 1.0 and 2021 · KhronosGroup/OpenCL-Docs". GitHub. December 2021. Retrieved December 2, 2022.
  32. 1 2 "The C++ for OpenCL 1.0 Programming Language Documentation". www.khronos.org. Retrieved April 18, 2021.
  33. 1 2 3 4 "Release of C++ for OpenCL Kernel Language Documentation, version 1.0, revision 2 · KhronosGroup/OpenCL-Docs". GitHub. March 2021. Retrieved April 18, 2021.
  34. "cl_ext_cxx_for_opencl". www.khronos.org. September 2020. Retrieved April 18, 2021.
  35. "Mali SDK Supporting Compilation of Kernels in C++ for OpenCL". community.arm.com. December 2020. Retrieved April 18, 2021.
  36. "Clang Compiler User's Manual — C++ for OpenCL Support". clang.llvm.org. Retrieved April 18, 2021.
  37. "OpenCL-Guide, Offline Compilation of OpenCL Kernel Sources". GitHub. Retrieved April 18, 2021.
  38. "OpenCL-Guide, Programming OpenCL Kernels". GitHub. Retrieved April 18, 2021.
  39. 1 2 Clspv is a prototype compiler for a subset of OpenCL C to Vulkan compute shaders: google/clspv, August 17, 2019, retrieved August 20, 2019
  40. Petit, Kévin (April 17, 2021), Experimental implementation of OpenCL on Vulkan , retrieved April 18, 2021
  41. "Khronos Launches Heterogeneous Computing Initiative" (Press release). Khronos Group. June 16, 2008. Archived from the original on June 20, 2008. Retrieved June 18, 2008.
  42. "OpenCL gets touted in Texas". MacWorld. November 20, 2008. Retrieved June 12, 2009.
  43. "The Khronos Group Releases OpenCL 1.0 Specification" (Press release). Khronos Group. December 8, 2008. Retrieved December 4, 2016.
  44. "Apple Previews Mac OS X Snow Leopard to Developers" (Press release). Apple Inc. June 9, 2008. Archived from the original on March 18, 2012. Retrieved June 9, 2008.
  45. "AMD Drives Adoption of Industry Standards in GPGPU Software Development" (Press release). AMD. August 6, 2008. Retrieved August 14, 2008.
  46. "AMD Backs OpenCL, Microsoft DirectX 11". eWeek. August 6, 2008. Archived from the original on December 6, 2012. Retrieved August 14, 2008.
  47. "HPCWire: RapidMind Embraces Open Source and Standards Projects". HPCWire. November 10, 2008. Archived from the original on December 18, 2008. Retrieved November 11, 2008.
  48. "Nvidia Adds OpenCL To Its Industry Leading GPU Computing Toolkit" (Press release). Nvidia. December 9, 2008. Retrieved December 10, 2008.
  49. "OpenCL Development Kit for Linux on Power". alphaWorks. October 30, 2009. Archived from the original on August 9, 2011. Retrieved October 30, 2009.
  50. "Opencl Standard – an overview | ScienceDirect Topics". www.sciencedirect.com.
  51. "The OpenCL Specification Version: 1.0 Document Revision: 48" (PDF). Khronos OpenCL Working Group.
  52. "Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification". Archived from the original on March 2, 2016. Retrieved February 24, 2016.
  53. "Khronos Releases OpenCL 1.2 Specification". Khronos Group. November 15, 2011. Retrieved June 23, 2015.
  54. 1 2 3 "OpenCL 1.2 Specification" (PDF). Khronos Group. Retrieved June 23, 2015.
  55. "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved February 10, 2014.
  56. "Khronos Releases OpenCL 2.1 and SPIR-V 1.0 Specifications for Heterogeneous Parallel Programming". Khronos Group. November 16, 2015. Retrieved November 16, 2015.
  57. "Khronos Announces OpenCL 2.1: C++ Comes to OpenCL". AnandTech. March 3, 2015. Retrieved April 8, 2015.
  58. "Khronos Releases OpenCL 2.1 Provisional Specification for Public Review". Khronos Group. March 3, 2015. Retrieved April 8, 2015.
  59. "OpenCL Overview". Khronos Group. July 21, 2013.
  60. 1 2 "Khronos Releases OpenCL 2.2 Provisional Specification with OpenCL C++ Kernel Language for Parallel Programming". Khronos Group. April 18, 2016.
  61. Trevett, Neil (April 2016). "OpenCL – A State of the Union" (PDF). IWOCL. Vienna: Khronos Group . Retrieved January 2, 2017.
  62. "Khronos Releases OpenCL 2.2 With SPIR-V 1.2". Khronos Group. May 16, 2017.
  63. 1 2 "OpenCL 2.2 Maintenance Update Released". The Khronos Group. May 14, 2018.
  64. "OpenCL 3.0 Bringing Greater Flexibility, Async DMA Extensions". www.phoronix.com.
  65. "Khronos Group Releases OpenCL 3.0". April 26, 2020.
  66. "The OpenCL Specification" (PDF). Khronos OpenCL Working Group.
  67. Trevett, Neil (2021). "State of the Union: OpenCL Working Group" (PDF). p. 9.
  68. "Using Semaphore and Memory Sharing Extensions for Vulkan Interop with NVIDIA OpenCL". February 24, 2022.
  69. "OpenCL 3.0.14 Released with New Extension for Command Buffer Multi-Device".
  70. "Breaking: OpenCL Merging Roadmap into Vulkan | PC Perspective". www.pcper.com. Archived from the original on November 1, 2017. Retrieved May 17, 2017.
  71. "SIGGRAPH 2018: OpenCL-Next Taking Shape, Vulkan Continues Evolving – Phoronix". www.phoronix.com.
  72. "Vulkan Update SIGGRAPH 2019" (PDF).
  73. Trevett, Neil (May 23, 2019). "Khronos and OpenCL Overview EVS Workshop May19" (PDF). Khronos Group.
  74. "OpenCL ICD Specification" . Retrieved June 23, 2015.
  75. "Apple entry on LLVM Users page" . Retrieved August 29, 2009.
  76. "Nvidia entry on LLVM Users page" . Retrieved August 6, 2009.
  77. "Rapidmind entry on LLVM Users page" . Retrieved October 1, 2009.
  78. "Zack Rusin's blog post about the Gallium3D OpenCL implementation". February 2009. Retrieved October 1, 2009.
  79. "GalliumCompute". dri.freedesktop.org. Retrieved June 23, 2015.
  80. "Clover Status Update" (PDF).
  81. "mesa/mesa – The Mesa 3D Graphics Library". cgit.freedesktop.org.
  82. "Gallium Clover With SPIR-V & NIR Opening Up New Compute Options Inside Mesa – Phoronix". www.phoronix.com. Archived from the original on October 22, 2020. Retrieved December 13, 2018.
  83. Clark, Rob; Herbst, Karol (2018). "OpenCL support inside mesa through SPIR-V and NIR" (PDF).
  84. "Mesa's 'Rusticl' Implementation Now Manages to Handle Darktable OpenCL".
  85. "Mesa's Rusticl Achieves Official OpenCL 3.0 Conformance".
  86. "Mesa 22.3 Released with RDNA3 Vulkan, Rusticl OpenCL, Better Intel Arc Graphics".
  87. "Mesa's Rusticl OpenCL Driver Nearly Ready with AMD Radeon GPU Support".
  88. "Mesa's Rusticl OpenCL Implementation Can Outperform Radeon's ROCm Compute Stack".
  89. "State of Windows on Arm64: a high-level perspective". Chips and Cheese. March 13, 2022. Retrieved October 23, 2023.
  90. "Introducing OpenCL and OpenGL on DirectX". Collabora | Open Source Consulting. Retrieved October 23, 2023.
  91. "Deep dive into OpenGL over DirectX layering". Collabora | Open Source Consulting. Retrieved October 23, 2023.
  92. Larabel, Michael (January 10, 2013). "Beignet: OpenCL/GPGPU Comes For Ivy Bridge On Linux". Phoronix.
  93. Larabel, Michael (April 16, 2013). "More Criticism Comes Towards Intel's Beignet OpenCL". Phoronix.
  94. Larabel, Michael (December 24, 2013). "Intel's Beignet OpenCL Is Still Slowly Baking". Phoronix.
  95. "Beignet". freedesktop.org.
  96. "beignet – Beignet OpenCL Library for Intel Ivy Bridge and newer GPUs". cgit.freedesktop.org.
  97. "Intel Brings Beignet To Android For OpenCL Compute – Phoronix". www.phoronix.com.
  98. "01.org Intel Open Source – Compute Runtime". February 7, 2018.
  99. "NEO GitHub README". GitHub . March 21, 2019.
  100. "ROCm". GitHub. Archived from the original on October 8, 2016.
  101. "RadeonOpenCompute/ROCm: ROCm – Open Source Platform for HPC and Ultrascale GPU Computing". GitHub. March 21, 2019.
  102. "A Nice Overview Of The ROCm Linux Compute Stack – Phoronix". www.phoronix.com.
  103. "XDC Lightning.pdf". Google Docs.
  104. "Radeon ROCm 2.0 Officially Out With OpenCL 2.0 Support, TensorFlow 1.12, Vega 48-bit VA – Phoronix". www.phoronix.com.
  105. "Taking Radeon ROCm 2.0 OpenCL For A Benchmarking Test Drive – Phoronix". www.phoronix.com.
  106. https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD_ROCm_Release_Notes_v3.3.pdf%5B‍%5D
  107. "Radeon ROCm 3.5 Released with New Features but Still No Navi Support – Phoronix".
  108. "Radeon ROCm 3.10 Released with Data Center Tool Improvements, New APIs – Phoronix".
  109. "AMD Launches Arcturus as the Instinct MI100, Radeon ROCm 4.0 – Phoronix".
  110. "Welcome to AMD ROCm™ Platform — ROCm Documentation 1.0.0 documentation".
  111. "Home". docs.amd.com.
  112. "AMD Documentation – Portal".
  113. Jääskeläinen, Pekka; Sánchez de La Lama, Carlos; Schnetter, Erik; Raiskila, Kalle; Takala, Jarmo; Berg, Heikki (2016). "pocl: A Performance-Portable OpenCL Implementation". Int'l J. Parallel Programming. 43 (5): 752–785. arXiv: 1611.07083 . Bibcode:2016arXiv161107083J. doi:10.1007/s10766-014-0320-y. S2CID   9905244.
  114. 1 2 3 "pocl home page". pocl.
  115. "GitHub – pocl/pocl: pocl: Portable Computing Language". March 14, 2019 via GitHub.
  116. "HSA support implementation status as of 2016-05-17 — Portable Computing Language (pocl) 1.3-pre documentation". portablecl.org.
  117. "PoCL home page".
  118. "PoCL home page".
  119. "PoCL home page".
  120. "POCL 1.6-RC1 Released with Better CUDA Performance – Phoronix". Archived from the original on January 17, 2021. Retrieved December 3, 2020.
  121. Baumann, Tobias; Noack, Matthias; Steinke, Thomas (2021). "Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs" (PDF). p. 51.
  122. "PoCL home page".
  123. "PoCL home page".
  124. "PoCL home page".
  125. "PoCL home page".
  126. "About". Git.Linaro.org.
  127. Gall, T.; Pitney, G. (March 6, 2014). "LCA14-412: GPGPU on ARM SoC" (PDF). Amazon Web Services . Archived from the original (PDF) on July 26, 2020. Retrieved January 22, 2017.
  128. "zuzuf/freeocl". GitHub. Retrieved April 13, 2017.
  129. Zhang, Peng; Fang, Jianbin; Yang, Canqun; Tang, Tao; Huang, Chun; Wang, Zheng (2018). MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture (PDF). Proc. Int'l Conf. on Computing Frontiers. doi:10.1145/3203217.3203244.
  130. "Status". GitHub . March 16, 2022.
  131. "OpenCL Demo, AMD CPU". YouTube . December 10, 2008. Retrieved March 28, 2009.
  132. "OpenCL Demo, Nvidia GPU". YouTube . December 10, 2008. Retrieved March 28, 2009.
  133. "Imagination Technologies launches advanced, highly-efficient POWERVR SGX543MP multi-processor graphics IP family". Imagination Technologies. March 19, 2009. Archived from the original on April 3, 2014. Retrieved January 30, 2011.
  134. "AMD and Havok demo OpenCL accelerated physics". PC Perspective. March 26, 2009. Archived from the original on April 5, 2009. Retrieved March 28, 2009.
  135. "Nvidia Releases OpenCL Driver To Developers". Nvidia. April 20, 2009. Archived from the original on February 4, 2012. Retrieved April 27, 2009.
  136. "AMD does reverse GPGPU, announces OpenCL SDK for x86". Ars Technica. August 5, 2009. Retrieved August 6, 2009.[ permanent dead link ]
  137. Moren, Dan; Snell, Jason (June 8, 2009). "Live Update: WWDC 2009 Keynote". MacWorld.com. MacWorld. Retrieved June 12, 2009.
  138. "ATI Stream Software Development Kit (SDK) v2.0 Beta Program". Archived from the original on August 9, 2009. Retrieved October 14, 2009.
  139. "S3 Graphics launched the Chrome 5400E embedded graphics processor". Archived from the original on December 2, 2009. Retrieved October 27, 2009.
  140. "VIA Brings Enhanced VN1000 Graphics Processor". Archived from the original on December 15, 2009. Retrieved December 10, 2009.
  141. "ATI Stream SDK v2.0 with OpenCL 1.0 Support". Archived from the original on November 1, 2009. Retrieved October 23, 2009.
  142. "OpenCL". ZiiLABS. Retrieved June 23, 2015.
  143. "Intel discloses new Sandy Bridge technical details". Archived from the original on October 31, 2013. Retrieved September 13, 2010.
  144. http://reference.wolfram.com/mathematica/OpenCLLink/tutorial/Overview.html [ bare URL ]
  145. "WebCL related stories". Khronos Group. Retrieved June 23, 2015.
  146. "Khronos Releases Final WebGL 1.0 Specification". Khronos Group. Archived from the original on July 9, 2015. Retrieved June 23, 2015.
  147. "IBM Developer". developer.ibm.com.
  148. "Welcome to Wikis". www.ibm.com. October 20, 2009.
  149. "Nokia Research releases WebCL prototype". Khronos Group. May 4, 2011. Archived from the original on December 5, 2020. Retrieved June 23, 2015.
  150. KamathK, Sharath. "Samsung's WebCL Prototype for WebKit". Github.com. Archived from the original on February 18, 2015. Retrieved June 23, 2015.
  151. "AMD Opens the Throttle on APU Performance with Updated OpenCL Software Development". Amd.com. August 8, 2011. Retrieved June 16, 2013.
  152. "AMD APP SDK v2.6". Forums.amd.com. March 13, 2015. Retrieved June 23, 2015.[ dead link ]
  153. "The Portland Group Announces OpenCL Compiler for ST-Ericsson ARM-Based NovaThor SoCs" . Retrieved May 4, 2012.
  154. "WebCL Latest Spec". Khronos Group. November 7, 2013. Archived from the original on August 1, 2014. Retrieved June 23, 2015.
  155. "Altera Opens the World of FPGAs to Software Programmers with Broad Availability of SDK and Off-the-Shelf Boards for OpenCL". Altera.com. Archived from the original on January 9, 2014. Retrieved January 9, 2014.
  156. "Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs". Altera.com. Archived from the original on January 9, 2014. Retrieved January 9, 2014.
  157. "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved June 23, 2015.
  158. "WebCL 1.0 Press Release". Khronos Group. March 19, 2014. Retrieved June 23, 2015.
  159. "WebCL 1.0 Specification". Khronos Group. March 14, 2014. Retrieved June 23, 2015.
  160. "Intel OpenCL 2.0 Driver". Archived from the original on September 17, 2014. Retrieved October 14, 2014.
  161. "AMD OpenCL 2.0 Driver". Support.AMD.com. June 17, 2015. Retrieved June 23, 2015.
  162. "Xilinx SDAccel development environment for OpenCL, C, and C++, achieves Khronos Conformance – khronos.org news". The Khronos Group. Retrieved June 26, 2017.
  163. "Release 349 Graphics Drivers for Windows, Version 350.12" (PDF). April 13, 2015. Retrieved February 4, 2016.
  164. "AMD APP SDK 3.0 Released". Developer.AMD.com. August 26, 2015. Retrieved September 11, 2015.
  165. "Khronos Releases OpenCL 2.1 and SPIR-V 1.0 Specifications for Heterogeneous Parallel Programming". Khronos Group. November 16, 2015.
  166. "What's new? Intel® SDK for OpenCL™ Applications 2016, R3". Intel Software.
  167. "NVIDIA 378.66 drivers for Windows offer OpenCL 2.0 evaluation support". Khronos Group. February 17, 2017. Archived from the original on August 6, 2020. Retrieved March 17, 2017.
  168. Szuppe, Jakub (February 22, 2017). "NVIDIA enables OpenCL 2.0 beta-support".
  169. Szuppe, Jakub (March 6, 2017). "NVIDIA beta-support for OpenCL 2.0 works on Linux too".
  170. "The Khronos Group". The Khronos Group. March 21, 2019.
  171. "GitHub – RadeonOpenCompute/ROCm at roc-3.5.0". GitHub .
  172. 1 2 "NVIDIA is Now OpenCL 3.0 Conformant". April 12, 2021.
  173. 1 2 3 "The Khronos Group". The Khronos Group. December 12, 2022. Retrieved December 12, 2022.
  174. "Mesa's Rusticl Achieves Official OpenCL 3.0 Conformance". www.phoronix.com. Retrieved December 12, 2022.
  175. "The Khronos Group". The Khronos Group. August 20, 2019. Retrieved August 20, 2019.
  176. "KhronosGroup/OpenCL-CTL: The OpenCL Conformance Tests". GitHub. March 21, 2019.
  177. "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Archived from the original on August 4, 2011. Retrieved August 11, 2011.
  178. "About Intel OpenCL SDK 1.1". software.intel.com. intel.com. Retrieved August 11, 2011.
  179. "Intel® SDK for OpenCL™ Applications – Release Notes". software.intel.com. March 14, 2019.
  180. "Product Support" . Retrieved August 11, 2011.
  181. "Intel OpenCL SDK – Release Notes". Archived from the original on July 17, 2011. Retrieved August 11, 2011.
  182. "Announcing OpenCL Development Kit for Linux on Power v0.3". IBM . Retrieved August 11, 2011.
  183. "IBM releases OpenCL Development Kit for Linux on Power v0.3 – OpenCL 1.1 conformant release available". OpenCL Lounge. ibm.com. Retrieved August 11, 2011.
  184. "IBM releases OpenCL Common Runtime for Linux on x86 Architecture". IBM . October 20, 2009. Retrieved September 10, 2011.
  185. "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Archived from the original on September 6, 2011. Retrieved September 10, 2011.
  186. "Nvidia Releases OpenCL Driver". April 22, 2009. Retrieved August 11, 2011.
  187. "clinfo by Simon Leblanc". GitHub . Retrieved January 27, 2017.
  188. "clinfo by Oblomov". GitHub . Retrieved January 27, 2017.
  189. "clinfo: openCL INFOrmation". April 2, 2013. Retrieved January 27, 2017.
  190. "Khronos Products". The Khronos Group. Retrieved May 15, 2017.
  191. "OpenCL-CTS/Test_conformance at main · KhronosGroup/OpenCL-CTS". GitHub .
  192. "Issues · KhronosGroup/OpenCL-CTS". GitHub .
  193. "Intel Compute-Runtime 20.43.18277 Brings Alder Lake Support".
  194. "compute-runtime". 01.org. February 7, 2018.
  195. 1 2 Fang, Jianbin; Varbanescu, Ana Lucia; Sips, Henk (2011). "A Comprehensive Performance Comparison of CUDA and OpenCL". 2011 International Conference on Parallel Processing. Proc. Int'l Conf. on Parallel Processing. pp. 216–225. doi:10.1109/ICPP.2011.45. ISBN   978-1-4577-1336-1.
  196. Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peterson, Gregory; Dongarra, Jack (2012). "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming". Parallel Computing. 38 (8): 391–407. CiteSeerX   10.1.1.193.7712 . doi:10.1016/j.parco.2011.10.002.
  197. Dolbeau, Romain; Bodin, François; de Verdière, Guillaume Colin (September 7, 2013). "One OpenCL to rule them all?". 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS). pp. 1–6. doi:10.1109/MuCoCoS.2013.6633603. ISBN   978-1-4799-1010-6. S2CID   225784.
  198. Karimi, Kamran; Dickson, Neil G.; Hamze, Firas (2011). "A Performance Comparison of CUDA and OpenCL". arXiv: 1005.2581v3 [cs.PF].
  199. A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys, 2015.
  200. Grewe, Dominik; O'Boyle, Michael F. P. (2011). "A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL". Compiler Construction. Proc. Int'l Conf. on Compiler Construction. Lecture Notes in Computer Science. Vol. 6601. pp. 286–305. doi: 10.1007/978-3-642-19861-8_16 . ISBN   978-3-642-19860-1.
  201. "Radeon RX 6800 Series Has Excellent ROCm-Based OpenCL Performance On Linux". www.phoronix.com.