AoS and SoA

Last updated

In computing, an array of structures (AoS), structure of arrays (SoA) or array of structures of arrays (AoSoA) are contrasting ways to arrange a sequence of records in memory, with regard to interleaving, and are of interest in SIMD and SIMT programming.

Contents

Structure of arrays

Structure of arrays (SoA) is a layout separating elements of a record (or 'struct' in the C programming language) into one parallel array per field. [1] The motivation is easier manipulation with packed SIMD instructions in most instruction set architectures, since a single SIMD register can load homogeneous data, possibly transferred by a wide internal datapath (e.g. 128-bit). If only a specific part of the record is needed, only those parts need to be iterated over, allowing more data to fit onto a single cache line. The downside is requiring more cache ways when traversing data, and inefficient indexed addressing.

For example, to store N points in 3D space using a structure of arrays:

structpointlist3D{floatx[N];floaty[N];floatz[N];};structpointlist3Dpoints;floatget_point_x(inti){returnpoints.x[i];}

Array of structures

Array of structures (AoS) is the opposite (and more conventional) layout, in which data for different fields is interleaved. This is often more intuitive, and supported directly by most programming languages.

For example, to store N points in 3D space using an array of structures:

structpoint3D{floatx;floaty;floatz;};structpoint3Dpoints[N];floatget_point_x(inti){returnpoints[i].x;}

Array of structures of arrays

Array of structures of arrays (AoSoA) or tiled array of structs is a hybrid approach between the previous layouts, in which data for different fields is interleaved using tiles or blocks with size equal to the SIMD vector size. This is often less intuitive, but can achieve the memory throughput of the SoA approach, while being more friendly to the cache locality and load port architectures of modern processors. [2] In particular, memory requests in modern processors have to be fulfilled in fixed width (e.g., size of a cacheline [3] ). The tiled storage of AoSoA aligns the memory access pattern to the requests' fixed width, leading to fewer access operations to complete a memory request and thus increasing the efficiency. [4]

For example, to store N points in 3D space using an array of structures of arrays with a SIMD register width of 8 floats (or 8×32 = 256 bits):

structpoint3Dx8{floatx[8];floaty[8];floatz[8];};structpoint3Dx8points[(N+7)/8];floatget_point_x(inti){returnpoints[i/8].x[i%8];}

A different width may be needed depending on the actual SIMD register width. The interior arrays may be replaced with SIMD types such as float32x8 for languages with such support.

Alternatives

It is possible to split some subset of a structure (rather than each individual field) into a parallel array   and this can actually improve locality of reference if different pieces of fields are used at different times in the program (see data oriented design).

Some SIMD architectures provide strided load/store instructions to load homogeneous data from the SoA format. Yet another option used in some Cell libraries is to de-interleave data from the AoS format when loading sources into registers, and interleave when writing out results (facilitated by the superscalar issue of permutes). Some vector maths libraries align floating point 4D vectors with the SIMD register to leverage the associated data path and instructions, while still providing programmer convenience, although this does not scale to SIMD units wider than four lanes.

4D vectors

AoS vs. SoA presents a choice when considering 3D or 4D vector data on machines with four-lane SIMD hardware. SIMD ISAs are usually designed for homogeneous data, however some provide a dot product instruction [5] and additional permutes, making the AoS case easier to handle.

Although most GPU hardware has moved away from 4D instructions to scalar SIMT pipelines, [6] modern compute kernels using SoA instead of AoS can still give better performance due to memory coalescing. [7]

Software support


Most languages support the AoS format more naturally by combining records and various array abstract data types.

SoA is mostly found in languages, libraries, or metaprogramming tools used to support a data-oriented design. Examples include:

Automated creation of AoSoA is more complex. An example of AoSoA in metaprogramming is found in LANL's Cabana library written in C++; it assumes a vector width of 16 lanes by default. [8]

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the 8086 microprocessor and its variant, the 8088. The 8086 was introduced in 1978 as a fully 16-bit extension of 8-bit Intel's 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within a register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

Template metaprogramming (TMP) is a metaprogramming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates can include compile-time constants, data structures, and complete functions. The use of templates can be thought of as compile-time polymorphism. The technique is used by a number of languages, the best-known being C++, but also Curl, D, Nim, and XL.

<span class="mw-page-title-main">Graphics processing unit</span> Specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit initially designed to accelerate computer graphics and image processing. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. In computer architecture, registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address e.g. DEC PDP-10, ICT 1900.

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow!, SSE, and SSE2.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

Compute Unified Device Architecture (CUDA) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span> Canceled Intel GPGPU chip

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

<span class="mw-page-title-main">VideoCore</span> Low-power mobile multimedia processor

VideoCore is a series of low-power mobile multimedia processors originally developed by Alphamosaic Ltd and now owned by Broadcom. Alphamosaic marketed its first version as a two-dimensional DSP architecture that makes it flexible and efficient enough to decode a number of multimedia codecs in software while maintaining low power usage. The semiconductor intellectual property core has been found so far only on Broadcom SoCs.

Intel Advisor is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. It supports OpenMP. Intel Advisor user interface is also available on macOS.

In computer architecture, 512-bit integers, memory addresses, or other data units are those that are 512 bits wide. Also, 512-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size. There are currently no mainstream general-purpose processors built to operate on 512-bit integers or addresses, though a number of processors do operate on 512-bit data.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

This is a glossary of terms relating to computer graphics.

In computer science, a 4D vector is a 4-component vector data type. Uses include homogeneous coordinates for 3-dimensional space in computer graphics, and red green blue alpha (RGBA) values for bitmap images with a color and alpha channel. They may also represent quaternions although the algebra they define is different.

Permute instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.

References

  1. "How to Manipulate Data Structure to Optimize Memory Use". Intel. 2012-02-09. Retrieved 2019-03-17.
  2. "Memory Layout Transformations". Intel. 2019-03-26. Retrieved 2019-06-02.
  3. "Kernel Profiling Guide" (PDF). NVIDIA. 2022-12-01. Retrieved 2022-01-14.)
  4. Fei, Yun (Raymond); Huang, Yuhan; Gao, Ming (2021), "Principles towards Real-Time Simulation of Material Point Method on Modern GPUs", pp. 1–16, arXiv: 2111.00699 [cs.GR]
  5. "Intel SSE4 Floating Point Dot Product Intrinsics". Intel. Archived from the original on 2016-06-24. Retrieved 2019-03-17.
  6. "Modern GPU Architecture (See Scalar Unified Pipelines)" (PDF). NVIDIA. Archived from the original (PDF) on 2018-05-17. Retrieved 2019-03-17.
  7. Kim, Hyesoon (2010-02-08). "CUDA Optimization Strategies" (PDF). CS4803 Design Game Consoles. Retrieved 2019-03-17.
  8. "ECP-copa/Cabana: AoSoA". GitHub.