RCUDA

Last updated
rCUDA
Developer(s) Universitat Politecnica de Valencia
Stable release
20.07 / July 26, 2020;3 years ago (2020-07-26)
Operating system Linux
Type GPGPU
Website www.rcuda.net   OOjs UI icon edit-ltr-progressive.svg

rCUDA, which stands for Remote CUDA, is a type of middleware software framework for remote GPU virtualization. Fully compatible with the CUDA application programming interface (API), it allows the allocation of one or more CUDA-enabled GPUs to a single application. Each GPU can be part of a cluster or running inside of a virtual machine. The approach is aimed at improving performance in GPU clusters that are lacking full utilization. GPU virtualization reduces the number of GPUs needed in a cluster, and in turn, leads to a lower cost configuration – less energy, acquisition, and maintenance.

Contents

The recommended distributed acceleration architecture is a high performance computing cluster with GPUs attached to only a few of the cluster nodes. When a node without a local GPU executes an application needing GPU resources, remote execution of the kernel is supported by data and code transfers between local system memory and remote GPU memory. rCUDA is designed to accommodate this client-server architecture. On one end, clients employ a library of wrappers to the high-level CUDA Runtime API, and on the other end, there is a network listening service that receives requests on a TCP port. Several nodes running different GPU-accelerated applications can concurrently make use of the whole set of accelerators installed in the cluster. The client forwards the request to one of the servers, which accesses the GPU installed in that computer and executes the request in it. Time-multiplexing the GPU, or in other words sharing it, is accomplished by spawning different server processes for each remote GPU execution request. [1] [2] [3] [4] [5] [6]

rCUDA v20.07

The rCUDA middleware enables the concurrent usage of CUDA-compatible devices remotely.

rCUDA employs either the InfiniBand network or the socket API for the communication between clients and servers. rCUDA can be useful in three different environments:

The current version of rCUDA (v20.07) supports CUDA version 9.0, excluding graphics interoperability. rCUDA v20.07 targets the Linux OS (for 64-bit architectures) on both client and server sides.

CUDA applications do not need any change in their source code in order to be executed with rCUDA.

Related Research Articles

<span class="mw-page-title-main">Graphics processing unit</span> Specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit initially designed to accelerate computer graphics and image processing. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

GPFS is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 Top 500 List. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5 TB/s of sequential I/O and 2.2 TB/s of random I/O.

VirtualGL is an open-source software package that redirects the 3D rendering commands from Unix and Linux OpenGL applications to 3D accelerator hardware in a dedicated server and sends the rendered output to a (thin) client located elsewhere on the network. On the server side, VirtualGL consists of a library that handles the redirection and a wrapper program that instructs applications to use this library. Clients can connect to the server either using a remote X11 connection or using an X11 proxy such as a VNC server. In case of an X11 connection some client-side VirtualGL software is also needed to receive the rendered graphics output separately from the X11 stream. In case of a VNC connection no specific client-side software is needed other than the VNC client itself.

The Texas Advanced Computing Center (TACC) at the University of Texas at Austin, United States, is an advanced computing research center that is based on comprehensive advanced computing resources and supports services to researchers in Texas and across the U.S. The mission of TACC is to enable discoveries that advance science and society through the application of advanced computing technologies. Specializing in high performance computing, scientific visualization, data analysis & storage systems, software, research & development and portal interfaces, TACC deploys and operates advanced computational infrastructure to enable the research activities of faculty, staff, and students of UT Austin. TACC also provides consulting, technical documentation, and training to support researchers who use these resources. TACC staff members conduct research and development in applications and algorithms, computing systems design/architecture, and programming tools and environments.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

Compute Unified Device Architecture (CUDA) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

<span class="mw-page-title-main">Molecular modeling on GPUs</span> Using graphics processing units for molecular simulations

Molecular modeling on GPU is the technique of using a graphics processing unit (GPU) for molecular simulations.

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data centre, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

Linode, LLC is an American cloud hosting provider that focuses on providing Linux-based virtual machines, cloud infrastructure, and managed services.

Microsoft RemoteFX is a Microsoft brand name that covers a set of technologies that enhance visual experience of the Microsoft-developed remote display protocol Remote Desktop Protocol (RDP). RemoteFX was first introduced in Windows Server 2008 R2 SP1 and is based on intellectual property that Microsoft acquired and continued to develop since acquiring Calista Technologies. It is a part of the overall Remote Desktop Services workload.

GridRPC in distributed computing, is Remote Procedure Call over a grid. This paradigm has been proposed by the GridRPC working group of the Open Grid Forum (OGF), and an API has been defined in order for clients to access remote servers as simply as a function call. It is used among numerous Grid middleware for its simplicity of implementation, and has been standardized by the OGF in 2007. For interoperability reasons between the different existing middleware, the API has been followed by a document describing good use and behavior of the different GridRPC API implementations. Works have then been conducted on the GridRPC Data Management, which has been standardized in 2011.

GPULib is discontinued and unsupported software library developed by Tech-X Corporation for accelerating general-purpose scientific computations from within the Interactive Data Language (IDL) using Nvidia's CUDA platform for programming its graphics processing units (GPUs). GPULib provides basic arithmetic, array indexing, special functions, Fast Fourier Transforms (FFT), interpolation, BLAS matrix operations as well as LAPACK routines provided by MAGMA, and some image processing operations. All numeric data types provided by IDL are supported. GPULib is used in medical imaging, optics, astronomy, earth science, remote sensing, and other scientific areas.

<span class="mw-page-title-main">Fermi (microarchitecture)</span> GPU microarchitecture by Nvidia

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and 500 series. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from Nvidia that receives support for Microsoft's rendering API Direct3D 12 feature_level 11.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

GPU virtualization refers to technologies that allow the use of a GPU to accelerate graphics or GPGPU applications running on a virtual machine. GPU virtualization is used in various applications such as desktop virtualization, cloud gaming and computational science.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

<span class="mw-page-title-main">Pico (supercomputer)</span>

PICO is an Intel Cluster installed in the data center of Cineca. PICO is intended to enable new "BigData" classes of applications, related to the management and processing of large quantities of data, coming both from simulations and experiments. The cluster is made of an Intel NeXtScale server, designed to optimize density and performance, driving a large data repository shared among all the HPC systems in Cineca.

<span class="mw-page-title-main">ROCm</span> Parallel computing platform: GPGPU libraries and application programming interface

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), and OpenCL.

References

  1. J. Prades; F. Silla (December 2019). "GPU-Job Migration: the rCUDA Case". Transactions on Parallel and Distributed Systems, vol 30, no. 12.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)
  2. J. Prades; C. Reaño; F. Silla (March 2019). "On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines". Cluster Computing, vol.22, no. 1.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)
  3. F. Silla; S. Iserte; C. Reaño; J. Prades (July 2017). "On the Benefits of the Remote GPU Virtualization Mechanism: the rCUDA Case". Concurrency and Computation: Practice and Experience, vol. 29, no. 13.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)
  4. J. Prades; B. Varghese; C. Reaño; F. Silla (October 2017). "Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application". Journal of Parallel and Distributed Computing, vol. 108. arXiv: 1606.04473 .{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)
  5. F. Pérez; C. Reaño; F. Silla (June 6–9, 2016). "Providing CUDA Acceleration to KVM Virtual Machines in InfiniBand Clusters with rCUDA". 16th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS 2016), Heraklion, Crete, Greece.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)
  6. S. Iserte; J. Prades; C. Reaño; F. Silla (May 16–19, 2016). "Increasing the Performance of Data Centers by Combining Remote GPU Virtualization with Slurm". 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016), Cartagena, Colombia.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: location (link)