CuPy

CuPy
Original author(s)	Seiya Tokui
Developer(s)	Community, Preferred Networks, Inc.
Initial release	September 2, 2015;7 years ago.
Stable release	v10.5.0 / May 26, 2022;11 months ago
Preview release	v11.0.0b3 / May 26, 2022;11 months ago
Repository	github.com/cupy/cupy
Written in	Python, Cython, CUDA
Operating system	Linux, Windows
Platform	Cross-platform
Type	Numerical analysis
License	MIT
Website	cupy.dev

Last updated April 30, 2023

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them.^[3] CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports NVIDIA CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.^[4]^[5]

Features

CuPy implements NumPy/SciPy-compatible APIs, as well as features to write user-defined GPU kernels or access low-level APIs.^[14]^[15]

NumPy-compatible APIs

The same set of APIs defined in the NumPy package (numpy.*) are available under cupy.* package.

Multi-dimensional array (cupy.ndarray) for boolean, integer, float, and complex data types
Module-level functions
Linear algebra functions
Fast Fourier transform
Random number generator

SciPy-compatible APIs

The same set of APIs defined in the SciPy package (scipy.*) are available under cupyx.scipy.* package.

Sparse matrices (cupyx.scipy.sparse.*_matrix) of CSR, COO, CSC, and DIA format
Discrete Fourier transform
Advanced linear algebra
Multidimensional image processing
Sparse linear algebra
Special functions
Signal processing
Statistical functions

User-defined GPU kernels

Kernel templates for element-wise and reduction operations
Raw kernel (CUDA C/C++)
Just-in-time transpiler (JIT)
Kernel fusion

Distributed computing

Distributed communication package (cupyx.distributed), providing collective and peer-to-peer primitives

Low-level CUDA features

Stream and event
Memory pool
Profiler
Host API binding
CUDA Python support^[16]

Interoperability

DLPack^[17]
CUDA Array Interface^[18]
NEP 13 (__array_ufunc__)^[19]
NEP 18 (__array_function__)^[20]^[21]
Array API Standard^[22]^[23]

Examples

Array creation

>>> importcupyascp>>> x=cp.array([1,2,3])>>> xarray([1, 2, 3])>>> y=cp.arange(10)>>> yarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Basic operations

>>> importcupyascp>>> x=cp.arange(12).reshape(3,4).astype(cp.float32)>>> xarray([[ 0.,  1.,  2.,  3.],       [ 4.,  5.,  6.,  7.],       [ 8.,  9., 10., 11.]], dtype=float32)>>> x.sum(axis=1)array([ 6., 22., 38.], dtype=float32)

Raw CUDA C/C++ kernel

>>> importcupyascp>>> kern=cp.RawKernel(r'''... extern "C" __global__... void multiply_elemwise(const float* in1, const float* in2, float* out) {...     int tid = blockDim.x * blockIdx.x + threadIdx.x;...     out[tid] = in1[tid] * in2[tid];... }... ''','multiply_elemwise')>>> in1=cp.arange(16,dtype=cp.float32).reshape(4,4)>>> in2=cp.arange(16,dtype=cp.float32).reshape(4,4)>>> out=cp.zeros((4,4),dtype=cp.float32)>>> kern((4,),(4,),(in1,in2,out))# grid, block and arguments>>> outarray([[  0.,   1.,   4.,   9.],       [ 16.,  25.,  36.,  49.],       [ 64.,  81., 100., 121.],       [144., 169., 196., 225.]], dtype=float32)

Applications

Related Research Articles

SciPy is a free and open-source Python library used for scientific computing and technical computing.

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The predecessor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. NumPy is a NumFOCUS fiscally sponsored project.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

nouveau is a free and open-source graphics device driver for Nvidia video cards and the Tegra family of SoCs written by independent software engineers, with minor help from Nvidia employees.

<span class="mw-page-title-main">IPython</span> Advanced interactive shell for Python

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history. IPython provides the following features:

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones. In Theano, computations are expressed using a NumPy-esque syntax and compiled to run efficiently on either CPU or GPU architectures.

GPULib is discontinued and unsupported software library developed by Tech-X Corporation for accelerating general-purpose scientific computations from within the Interactive Data Language (IDL) using Nvidia's CUDA platform for programming its graphics processing units (GPUs). GPULib provides basic arithmetic, array indexing, special functions, Fast Fourier Transforms (FFT), interpolation, BLAS matrix operations as well as LAPACK routines provided by MAGMA, and some image processing operations. All numeric data types provided by IDL are supported. GPULib is used in medical imaging, optics, astronomy, earth science, remote sensing, and other scientific areas.

Numba is an open-source JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM, via the llvmlite Python package. It offers a range of options for parallelising Python code for CPUs and GPUs, often with only minor code changes.

<span class="mw-page-title-main">GPUOpen</span> Middleware software suite

GPUOpen is a middleware software suite originally developed by AMD's Radeon Technologies Group that offers advanced visual effects for computer games. It was released in 2016. GPUOpen serves as an alternative to, and a direct competitor of Nvidia GameWorks. GPUOpen is similar to GameWorks in that it encompasses several different graphics technologies as its main components that were previously independent and separate from one another. However, GPUOpen is entirely open source software, unlike GameWorks which is proprietary and closed.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.

PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface.

Nvidia NVDEC is a feature in its graphics cards that performs video decoding, offloading this compute-intensive task from the CPU.

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), OpenCL.

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

oneAPI (compute acceleration) Open standard for parallel computing

oneAPI is an open standard, trademarked by Intel, for a unified application programming interface intended to be used across different compute accelerator (coprocessor) architectures, including GPUs, AI accelerators and field-programmable gate arrays. It is intended to eliminate the need for developers to maintain separate code bases, multiple programming languages, and different tools and workflows for each architecture.

Google JAX is a machine learning framework for transforming numerical functions. It is described as bringing together a modified version of autograd and TensorFlow's XLA. It is designed to follow the structure and workflow of NumPy as closely as possible and works with various existing frameworks such as TensorFlow and PyTorch. The primary functions of JAX are:

grad: automatic differentiation
jit: compilation
vmap: auto-vectorization
pmap: SPMD programming

References

↑ "Release v1.3.0 – chainer/chainer" . Retrieved 25 June 2022– via GitHub.
1 2 3 4 "Releases – cupy/cupy" . Retrieved 18 June 2022– via GitHub.
↑ Okuta, Ryosuke; Unno, Yuya; Nishino, Daisuke; Hido, Shohei; Loomis, Crissman (2017). CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations (PDF). Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS).
↑ "CuPy 9.0 Brings AMD GPU Support To This Numpy-Compatible Library - Phoronix". Phoronix . 29 April 2021. Retrieved 21 June 2022.
↑ "AMD Leads High Performance Computing Towards Exascale and Beyond". 28 June 2021. Retrieved 21 June 2022. Most recently, CuPy, an open-source array library with Python, has expanded its traditional GPU support with the introduction of version 9.0 that now offers support for the ROCm stack for GPU-accelerated computing.
↑ "Preferred Networks released Version 2 of Chainer, an Open Source framework for Deep Learning - Preferred Networks, Inc". 2 June 2017. Retrieved 18 June 2022.
↑ "NumPy". numpy.org. Retrieved 21 June 2022.
↑ Gorelick, Micha; Ozsvald, Ian (April 2020). High Performance Python: Practical Performant Programming for Humans (2nd ed.). O'Reilly Media, Inc. p. 190. ISBN 9781492055020.
↑ Oak Ridge Leadership Computing Facility. "Installing CuPy". OLCF User Documentation. Retrieved 21 June 2022.
↑ National Energy Research Scientific Computing Center. "Using Python on Perlmutter". NERSC Documentation. Retrieved 21 June 2022.
↑ ETH Zurich. "CuPy". ScientificComputing. Retrieved 21 June 2022.
↑ National Institute of Advanced Industrial Science and Technology. "Chainer". ABCI 2.0 User Guide. Retrieved 21 June 2022.
↑ "Affiliated Projects - NumFOCUS" . Retrieved 18 June 2022.
↑ "Overview". CuPy documentation. Retrieved 18 June 2022.
↑ "Comparison Table". CuPy documentation. Retrieved 18 June 2022.
↑ "CUDA Python | NVIDIA Developer" . Retrieved 21 June 2022.
↑ "Welcome to DLPack's documentation!". DLPack 0.6.0 documentation. Retrieved 21 June 2022.
↑ "CUDA Array Interface (Version 3)". Numba 0.55.2+0.g2298ad618.dirty-py3.7-linux-x86_64.egg documentation. Retrieved 21 June 2022.
↑ "NEP 13 — A mechanism for overriding Ufuncs — NumPy Enhancement Proposals". numpy.org. Retrieved 21 June 2022.
↑ "NEP 18 — A dispatch mechanism for NumPy's high level array functions — NumPy Enhancement Proposals". numpy.org. Retrieved 21 June 2022.
↑ Charles R Harris; K. Jarrod Millman; Stéfan J. van der Walt; et al. (16 September 2020). "Array programming with NumPy" (PDF). Nature . 585 (7825): 357–362. arXiv: 2006.10256 . doi:10.1038/S41586-020-2649-2. ISSN 1476-4687. PMC 7759461 . PMID 32939066. Wikidata Q99413970.
↑ "2021 report - Python Data APIs Consortium" (PDF). Retrieved 21 June 2022.
↑ "Purpose and scope". Python array API standard 2021.12 documentation. Retrieved 21 June 2022.
↑ "Install spaCy". spaCy Usage Documentation. Retrieved 21 June 2022.
↑ Patel, Ankur A.; Arasanipalai, Ajay Uppili (May 2021). Applied Natural Language Processing in the Enterprise (1st ed.). O'Reilly Media, Inc. p. 68. ISBN 9781492062578.
↑ "Python Package Introduction". xgboost 1.6.1 documentation. Retrieved 21 June 2022.
↑ "UCBerkeleySETI/turbo_seti: turboSETI -- python based SETI search algorithm". GitHub . Retrieved 21 June 2022.
↑ "Open GPU Data Science | RAPIDS" . Retrieved 21 June 2022.
↑ "API Docs". RAPIDS Docs. Retrieved 21 June 2022.
↑ "Efficient Data Sharing between CuPy and RAPIDS" . Retrieved 21 June 2022.
↑ "10 Minutes to cuDF and CuPy" . Retrieved 21 June 2022.
↑ Alex, Rogozhnikov (2022). Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation. International Conference on Learning Representations.
↑ "arogozhnikov/einops: Deep learning operations reinvented (for pytorch, tensorflow, jax and others)". GitHub . Retrieved 21 June 2022.
↑ Tokui, Seiya; Okuta, Ryosuke; Akiba, Takuya; Niitani, Yusuke; Ogawa, Toru; Saito, Shunta; Suzuki, Shuji; Uenishi, Kota; Vogel, Brian; Vincent, Hiroyuki Yamazaki (2019). Chainer: A Deep Learning Framework for Accelerating the Research Cycle. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:10.1145/3292500.3330756.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Release v1.3.0 – chainer/chainer" . Retrieved 25 June 2022– via GitHub.

[github-releases-2] 1 2 3 4 "Releases – cupy/cupy" . Retrieved 18 June 2022– via GitHub.

[3] Okuta, Ryosuke; Unno, Yuya; Nishino, Daisuke; Hido, Shohei; Loomis, Crissman (2017). CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations (PDF). Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS).

[4] "CuPy 9.0 Brings AMD GPU Support To This Numpy-Compatible Library - Phoronix". Phoronix . 29 April 2021. Retrieved 21 June 2022.

[5] "AMD Leads High Performance Computing Towards Exascale and Beyond". 28 June 2021. Retrieved 21 June 2022. Most recently, CuPy, an open-source array library with Python, has expanded its traditional GPU support with the introduction of version 9.0 that now offers support for the ROCm stack for GPU-accelerated computing.

[6] "Preferred Networks released Version 2 of Chainer, an Open Source framework for Deep Learning - Preferred Networks, Inc". 2 June 2017. Retrieved 18 June 2022.

[7] "NumPy". numpy.org. Retrieved 21 June 2022.

[8] Gorelick, Micha; Ozsvald, Ian (April 2020). High Performance Python: Practical Performant Programming for Humans (2nd ed.). O'Reilly Media, Inc. p. 190. ISBN 9781492055020.

[9] Oak Ridge Leadership Computing Facility. "Installing CuPy". OLCF User Documentation. Retrieved 21 June 2022.

[10] National Energy Research Scientific Computing Center. "Using Python on Perlmutter". NERSC Documentation. Retrieved 21 June 2022.

[11] ETH Zurich. "CuPy". ScientificComputing. Retrieved 21 June 2022.

[12] National Institute of Advanced Industrial Science and Technology. "Chainer". ABCI 2.0 User Guide. Retrieved 21 June 2022.

[13] "Affiliated Projects - NumFOCUS" . Retrieved 18 June 2022.

[14] "Overview". CuPy documentation. Retrieved 18 June 2022.

[15] "Comparison Table". CuPy documentation. Retrieved 18 June 2022.

[16] "CUDA Python | NVIDIA Developer" . Retrieved 21 June 2022.

[17] "Welcome to DLPack's documentation!". DLPack 0.6.0 documentation. Retrieved 21 June 2022.

[18] "CUDA Array Interface (Version 3)". Numba 0.55.2+0.g2298ad618.dirty-py3.7-linux-x86_64.egg documentation. Retrieved 21 June 2022.

[19] "NEP 13 — A mechanism for overriding Ufuncs — NumPy Enhancement Proposals". numpy.org. Retrieved 21 June 2022.

[20] "NEP 18 — A dispatch mechanism for NumPy's high level array functions — NumPy Enhancement Proposals". numpy.org. Retrieved 21 June 2022.

[21] Charles R Harris; K. Jarrod Millman; Stéfan J. van der Walt; et al. (16 September 2020). "Array programming with NumPy" (PDF). Nature . 585 (7825): 357–362. arXiv: 2006.10256 . doi:10.1038/S41586-020-2649-2. ISSN 1476-4687. PMC 7759461 . PMID 32939066. Wikidata Q99413970.

[22] "2021 report - Python Data APIs Consortium" (PDF). Retrieved 21 June 2022.

[23] "Purpose and scope". Python array API standard 2021.12 documentation. Retrieved 21 June 2022.

[24] "Install spaCy". spaCy Usage Documentation. Retrieved 21 June 2022.

[25] Patel, Ankur A.; Arasanipalai, Ajay Uppili (May 2021). Applied Natural Language Processing in the Enterprise (1st ed.). O'Reilly Media, Inc. p. 68. ISBN 9781492062578.

[26] "Python Package Introduction". xgboost 1.6.1 documentation. Retrieved 21 June 2022.

[27] "UCBerkeleySETI/turbo_seti: turboSETI -- python based SETI search algorithm". GitHub . Retrieved 21 June 2022.

[28] "Open GPU Data Science | RAPIDS" . Retrieved 21 June 2022.

[29] "API Docs". RAPIDS Docs. Retrieved 21 June 2022.

[30] "Efficient Data Sharing between CuPy and RAPIDS" . Retrieved 21 June 2022.

[31] "10 Minutes to cuDF and CuPy" . Retrieved 21 June 2022.

[32] Alex, Rogozhnikov (2022). Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation. International Conference on Learning Representations.

[33] "arogozhnikov/einops: Deep learning operations reinvented (for pytorch, tensorflow, jax and others)". GitHub . Retrieved 21 June 2022.

[34] Tokui, Seiya; Okuta, Ryosuke; Akiba, Takuya; Niitani, Yusuke; Ogawa, Toru; Saito, Shunta; Suzuki, Shuji; Uenishi, Kota; Vogel, Brian; Vincent, Hiroyuki Yamazaki (2019). Chainer: A Deep Learning Framework for Accelerating the Research Cycle. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:10.1145/3292500.3330756.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]