Permute instruction

Last updated

Permute (and Shuffle) instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. [1] The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.

Contents

There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows:

fori=0tolength-1dest[i]=src[indices[i]]

where the scatter variant is:

fori=0tolength-1dest[indices[i]]=src[i]

Note that unlike in memory-based gather-scatter all three of dest, src, and indices are registers (or parts of registers in the case of bit-level permute), not memory locations.

The scatter variant can be seen to "scatter" the source elements across (into) to the destination, where the "gather" variant is gathering data from the indexed source elements.

Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical permutation because duplicates can occur in the output.

A special case of permute is also used in GPU "swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate SIMD lane.

Occurrences of permute instructions

Permute instructions occur in both scalar processors as well as vector processing engines as well as GPUs. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in RISC-V vectors, [2] and take Vectors as input for both source elements and source array, and output another Vector.

In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked, SIMD style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result.

Some ISAs, particularly for cryptographic applications, even have bit-level permute operations, such as bdep (bit deposit) in RISC-V bitmanip; [3] in the Power ISA it is known as bpermd and has been included for several decades, and is still in the Power ISA v.3.0 B spec. [4]

Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include VSHUFF32x4 from AVX-512.

Permute operations in different forms are surprisingly common, occurring in AltiVec, Power ISA, PowerPC G4, AVX-512, SVE2, [5] vector processors, and GPUs. They are sufficiently important that LLVM added the shufflevector [6] intrinsic and GCC added the __builtin_shuffle intrinsic. [7] GCC's intrinsic matches the functionality of OpenCL's shuffle intrinsics. [8] Note that all of these, mathematically, are not permutations because duplicates can occur in the output.

See also

Related Research Articles

GNU Compiler Collection Free and open-source compiler for various programming languages

The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software under the GNU General Public License. GCC is a key component of the GNU toolchain and the standard compiler for most projects related to GNU and the Linux kernel. With roughly 15 million lines of code in 2019, GCC is one of the biggest free programs in existence. It has played an important role in the growth of free software, as both a tool and an example.

Single instruction, multiple data Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

AltiVec is a single-precision floating point and integer SIMD instruction set designed and owned by Apple, IBM, and Freescale Semiconductor — the AIM alliance. It is implemented on versions of the PowerPC processor architecture, including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's PWRficient PA6T. AltiVec is a trademark owned solely by Freescale, so the system is also referred to as Velocity Engine by Apple and VMX by IBM and P.A. Semi.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.

OpenRISC is a project to develop a series of open-source hardware based central processing units (CPUs) on established reduced instruction set computer (RISC) principles. It includes an instruction set architecture (ISA) using an open-source license. It is the original flagship project of the OpenCores community.

LLVM Compiler backend for multiple programming languages

LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes.

In computer software, in compiler theory, an intrinsic function is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may substitute a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function, the compiler has an intimate knowledge of an intrinsic function and can thus better integrate and optimize it for a given situation.

Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

The XOP instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12, 2011. However AMD removed support for XOP from Zen (microarchitecture) onward.

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations. There are two variants:

Parallel Thread Execution is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's CUDA programming environment. The NVCC compiler translates code written in CUDA, a C++-like language, into PTX instructions, and the graphics driver contains a compiler which translates the PTX instructions into the executable binary code which can be run on the processing cores of Nvidia GPUs. The GNU Compiler Collection also has basic ability for PTX generation in the context of OpenMP offloading. Inline PTX assembly can be used in CUDA.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 and Skylake-X CPUs; this includes the Core-X series, as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series.

AArch64 64-bit extension of the ARM architecture

AArch64 or ARM64 is the 64-bit extension of the ARM architecture.

RISC-V is an open standard instruction set architecture (ISA) based on established RISC principles. Unlike most other ISA designs, RISC-V is provided under open source licenses that do not require fees to use. A number of companies are offering or have announced RISC-V hardware, open source operating systems with RISC-V support are available, and the instruction set is supported in several popular software toolchains.

In computing, array of structures (AoS), structure of arrays (SoA) and array of structures of arrays (AoSoA) refer to contrasting ways to arrange a sequence of records in memory, with regard to interleaving, and are of interest in SIMD and SIMT programming.

In computer science, a 4D vector is a 4-component vector data type. Uses include homogeneous coordinates for 3-dimensional space in computer graphics, and red green blue alpha (RGBA) values for bitmap images with a color and alpha channel. They may also represent quaternions although the algebra they define is different.

The Pixel Visual Core (PVC) is a series of ARM-based system in package (SiP) image processors designed by Google. The PVC is a fully programmable image, vision and AI multi-core domain-specific architecture (DSA) for mobile devices and in future for IoT. It first appeared in the Google Pixel 2 and 2 XL which were introduced on October 19, 2017. It has also appeared in the Google Pixel 3 and 3 XL. Starting with the Pixel 4, this chip was replaced with the Pixel Neural Core.

Libre-SOC is a libre soft processor core originally written by Luke Leighton and other contributors, announced at the OpenPOWER Summit NA 2020. It adheres to the Power ISA 3.0 instruction set and can be run on FPGA boards, currently booting MicroPython and other bare-metal applications.

References

  1. Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4 (PDF). Intel. June 2021. p. 5-356 Vol. 2C.
  2. "RISC-V "V" Vector Extension – 16.4. Vector Register Gather Instructions". GitHub – riscv/riscv-v-spec. Retrieved 2021-07-10.
  3. "riscv/riscv-bitmanip". GitHub. Retrieved 2021-07-10.
  4. "Power ISA Version 3.0 B". Power.org. 2017-03-27. Retrieved 2019-08-11.
  5. ARM HPC, SVE2 Extension summary, p32
  6. "LLVM 13 documentation: shufflevector". LLVM Language Reference Manual. Retrieved 2021-07-10.
  7. "Vector Extensions (Using the GNU Compiler Collection (GCC))". GCC, the GNU Compiler Collection - GNU Project - Free Software Foundation (FSF). Retrieved 2021-07-10.
  8. "OpenCL Specification: shuffle, shuffle2". The Khronos Group Inc. Retrieved 2021-07-10.