Gather/scatter (vector addressing)

Last updated

Gather/scatter is a type of memory addressing that at once collects (gathers) from, or stores (scatters) data to, multiple, arbitrary indices. Examples of its use include sparse linear algebra operations, [1] sorting algorithms, fast Fourier transforms, [2] and some computational graph theory problems. [3] It is the vector equivalent of register indirect addressing, with gather involving indexed reads, and scatter, indexed writes. Vector processors (and some SIMD units in CPUs) have hardware support for gather and scatter operations, as do many input/output systems, allowing large data sets to be transferred to main memory more rapidly.

Contents

The concept is somewhat similar to vectored I/O, which is sometimes also referred to as scatter-gather I/O. This system differs in that it is used to map multiple sources of data from contiguous structures into a single stream for reading or writing. A common example is writing out a series of strings, which in most programming languages would be stored in separate memory locations.

Definitions

Gather

A sparsely populated vector holding non-empty elements can be represented by two densely populated vectors of length ; containing the non-empty elements of , and giving the index in where 's element is located. The gather of into , denoted , assigns with having already been calculated. [4] Assuming no pointer aliasing between x[], y[],idx[], a C implementation is

for(i=0;i<N;++i)x[i]=y[idx[i]];

Scatter

The sparse scatter, denoted is the reverse operation. It copies the values of into the corresponding locations in the sparsely populated vector , i.e. .

for(i=0;i<N;++i)y[idx[i]]=x[i];

Support

Scatter/gather units were also a part of most vector computers, notably the Cray-1. In this case, the purpose was to efficiently store values in the limited resource of the vector registers. For instance, the Cray-1 had eight 64-word vector registers, so data that contained values that had no effect on the outcome, like zeros in an addition, were using up valuable space that would be better used. By gathering non-zero values into the registers, and scattering the results back out, the registers could be used much more efficiently, leading to higher performance. Such machines generally implemented two access models, scatter/gather and "stride", the latter designed to quickly load contiguous data. [5] This basic layout was widely copied in later supercomputer designs, especially on the variety of models from Japan.

As microprocessor design improved during the 1990s, commodity CPUs began to add vector processing units. At first these tended to be simple, sometimes overlaying the CPU's general purpose registers, but over time these evolved into increasingly powerful systems that met and then surpassed the units in high-end supercomputers. By this time, scatter/gather instructions had been added to many of these designs.

x86-64 CPUs which support the AVX2 instruction set can gather 32-bit and 64-bit elements with memory offsets from a base address. A second register determines whether the particular element is loaded, and faults occurring from invalid memory accesses by masked-out elements are suppressed. [6] :503–4 The AVX-512 instruction set also contains (potentially masked) scatter operations. [6] :539 [7] The ARM instruction set's Scalable Vector Extension includes gather and scatter operations on 8-, 16-, 32- and 64-bit elements. [8] [9] InfiniBand has hardware support for gather/scatter. [10]

Without instruction-level gather/scatter, efficient implementations may need to be tuned for optimal performance, for example with prefetching; libraries such as OpenMPI may provide such primitives. [2] [8]

See also

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">Cray-1</span> Supercomputer manufactured by Cray Research

The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, eighty Cray-1s were sold, making it one of the most successful supercomputers in history. It is perhaps best known for its unique shape, a relatively small C-shaped cabinet with a ring of benches around the outside covering the power supplies and the cooling system.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer architecture, predication is a feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in the design of modern processors and their functionalities. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming context has evolved as an extension of the classification system. Vector processing, covered by Duncan's taxonomy, is missing from Flynn's work because the Cray-1 was released in 1977: Flynn's second paper was published in 1972.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

The Cray-3/SSS was a pioneering massively parallel supercomputer project that bonded a two-processor Cray-3 to a new SIMD processing unit based entirely in the computer's main memory. It was later considered as an add-on for the Cray T90 series in the form of the T94/SSS, but there is no evidence this was ever built.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Advanced Vector Extensions (AVX) are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions, and a new coding scheme.

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations. There are two variants:

<span class="mw-page-title-main">Xeon Phi</span> Series of x86 manycore processors from Intel

Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.

Intel Advisor is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. It supports OpenMP. Intel Advisor user interface is also available on macOS.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Permute instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.

References

  1. Lewis, John G.; Simon, Horst D. (1 March 1988). "The Impact of Hardware Gather/Scatter on Sparse Gaussian Elimination". SIAM Journal on Scientific and Statistical Computing. 9 (2): 304–311. doi:10.1137/0909019.
  2. 1 2 He, Bingsheng; Govindaraju, Naga K.; Luo, Qiong; Smith, Burton (2007). "Efficient gather and scatter operations on graphics processors". Proceedings of the 2007 ACM/IEEE conference on Supercomputing (PDF). pp. 1–12. doi:10.1145/1362622.1362684. ISBN   9781595937643. S2CID   2928233.
  3. Kumar, Manoj; Serrano, Mauricio; Moreira, Jose; Pattnaik, Pratap; Horn, W P; Jann, Joefon; Tanase, Gabriel (September 2016). "Efficient implementation of scatter-gather operations for large scale graph analytics". 2016 IEEE High Performance Extreme Computing Conference (HPEC). pp. 1–7. doi:10.1109/HPEC.2016.7761578. ISBN   978-1-5090-3525-0. S2CID   10566760.
  4. BLAS Technical Forum standard, Chapter 3: Sparse BLAS.
  5. Bell, Gordon (25 January 1998). A Seymour Cray Perspective (Technical report).
  6. 1 2 Kusswurm, Daniel (2022). Modern parallel programming with C++ and Assembly language : X86 SIMD development using AVX, AVX2, and AVX-512. Apress Media. ISBN   978-1-4842-7917-5.
  7. Hossain, Md Maruf; Saule, Erik (9 August 2021). "Impact of AVX-512 Instructions on Graph Partitioning Problems". 50th International Conference on Parallel Processing Workshop. pp. 1–9. doi:10.1145/3458744.3473362. ISBN   9781450384414. S2CID   237350994.
  8. 1 2 Zhong, Dong; Shamis, Pavel; Cao, Qinglei; Bosilca, George; Sumimoto, Shinji; Miura, Kenichi; Dongarra, Jack (May 2020). "Using Arm Scalable Vector Extension to Optimize OPEN MPI" (PDF). 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). pp. 222–231. doi:10.1109/CCGrid49817.2020.00-71. ISBN   978-1-7281-6095-5. S2CID   220604878.
  9. "What is the Scalable Vector Extension?". ARM Developer. Retrieved 19 November 2022.
  10. Gainaru, Ana; Graham, Richard L.; Polyakov, Artem; Shainer, Gilad (25 September 2016). "Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All". Proceedings of the 23rd European MPI Users' Group Meeting. pp. 167–179. doi:10.1145/2966884.2966918. ISBN   9781450342346. S2CID   15880901.