SIMD within a register (SWAR), also known by the name "packed SIMD" [1] is a technique for performing parallel operations on data contained in a processor register. SIMD stands for single instruction, multiple data. Flynn's 1972 taxonomy categorises SWAR as "pipelined processing".
Flynn's taxonomy |
---|
Single data stream |
Multiple data streams |
SIMD subcategories [2] |
See also |
Many modern general-purpose computer processors have some provisions for SIMD, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks. [3]
A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose.
An early example of a SWAR architecture was the Intel Pentium with MMX, which implemented the MMX extension set. The Intel Pentium, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques.
Early SWAR architectures include DEC Alpha MVI, Hewlett-Packard's PA-RISC MAX, Silicon Graphics Incorporated's MIPS MDMX, and Sun's SPARC V9 VIS. Like MMX, many of the SWAR instruction sets are intended for faster video coding. [4]
Wesley A. Clark introduced partitioned subword data operations in the 1950s[ citation needed ]. This can be seen as a very early predecessor to SWAR. Leslie Lamport presented SWAR techniques in his paper titled "Multiple byte processing with full-word instructions" [5] in 1975.
With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code.
In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect of MasPar's MPL called NEMPL (Not Exactly MPL).
During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies.
Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's VIS, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called new media applications. These extensions had significant differences in the precision of data and types of instructions supported.
Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University. [6] It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.
The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98. [4]
SWAR processing has been used in image processing, [7] cryptographic pairings, [8] raster processing, [9] computational fluid dynamics, [10] and communications. [11]
SWAR techniques can be used even on systems without special hardware support. Logical operations act bitwise, so act on each bit of a register independently. Using addition and subtraction is more difficult, but can be useful if care is taken to avoid unwanted carry propagation between lanes. Except for this carry propagation, one 64-bit addition or subtraction is the same as performing eight 8-bit additions or subtractions.
Probably the archetypical example of SWAR techniques is finding the population count of (number of bits set in) a register. The register is treated successively as a series of 1-bit, 2-bit, 4-bit, etc. fields.
To begin with, note that the population count of a 1-bit field is simply the field itself. To find the population count of a 2-bit field, sum the population counts of its two constituent 1-bit fields. This can be done in parallel for 32 2-bit fields in a 64-bit value x
:
x2 := (x & 0x5555555555555555) + (x >> 1 & 0x5555555555555555);
The hexadecimal constant 0x5
is binary 01012, which isolates even-numbered bits. The addition cannot overflow each 2-bit field, as the maximum possible sum is 2.
This can be repeated to combine 2-bit fields into 4-bit fields. Here, we use a mask of binary 00112, or hexadecimal 0x3
, to isolate pairs of bits:
x4 := (x2 & 0x3333333333333333) + (x2 >> 2 & 0x3333333333333333);
Now each 4-bit field contains a count from 0 to 4. Because a 4-bit field can contain a value up to 15, no overflow is possible when adding two 4-bit population counts, allowing the masking to be done after the addition, rather than once per addend:
x8 := x4 + (x4 >> 4) & 0x0f0f0f0f0f0f0f0f;
At this point, the 8-bit fields can hold values up to 255, so there is no need for further masking until the very end:
x16 = x8 + (x8 >> 8); x32 = x16 + (x16 >> 16); x64 = x32 + (x32 >> 32); population_count = x64 & 0xff;
There are several well-known variants on this. In particular, the last three shift-and-add steps can be combined into
population_count = x8 * 0x0101010101010101 >> 56;
Three stages of shifting and adding require 6 instructions, each with a data dependency on the previous one, so take at least 6 clock cycles. A multiplication can usually be performed faster. When operating on 32-bit words, it's less clear, as a 3-cycle multiply is common.
A second variant is a change to the first step. Rather than combining the two bits b1 and b0 in each 2-bit field by adding them, consider the initial value of the 2-bit field as 2b1 + b0. Subtracting b1 from this will produce the desired sum, with only one masking operation:
x2 := x − (x >> 1 & 0x5555555555555555);
It is common to search a character string for a null terminator. Doing this one byte at a time is inefficient when a 64-bit processor can operate on 8 bytes at a time.
The same technique can be used to search for pathname separators or other delimiters, by exclusive-oring with the target byte value first.
Some architectures include special instructions for performing 8 byte comparisons at once. For example, the first 64-bit microprocessor the DEC Alpha included a CMPBGE
instruction for performing 8 byte compares at once. However, searching for a zero byte can be done without any special support.
One way would be to OR together 8 bits in a manner much like the bit-counting example above:
x2 = x | x<<1; x4 = x2 | x2<<2; x8 = x4 | x4<<4; byte_map = ~x8 & 0x8080808080808080;
This results in a byte_map
with a 1 bit in the most significant bit of any byte that was originally zero.
However, this can be done more quickly by taking advantage of carry propagation using arithmetic operations. Adding 0x7f
(binary 011111112) to each byte causes a carry into bit 7 if the low 7 bits are non-zero. The challenge is ensure the carry propagation stops at bit 7 and does not affect other bytes. This can be done by working on the low 7 bits and high bit of each byte separately. First, extract the low 7 bits of each byte by ANDing with 0x7f
before adding 0x7f
:
x7 = (x & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f;
Then combine with the most-significant bits:
x8 = x7 | x;
This value will have the msbit of each 8-bit field set to 1 if that byte is non-zero. Finally:
byte_map = ~(x8 | 0x7f7f7f7f7f7f7f7f);
will set all the unwanted low bits in each byte, then complement everything, leaving only 1 bits anywhere the corresponding input byte is zero. (This is equivalent to ~x8 & 0x80...80
, but uses the same constant value.) If there are no 1 bits, the search can continue with the following word. If there are any 1 bits, the length of the string can be computed from their positions.
If the goal is limited to finding the first zero byte on a little-endian processor, it is possible to find the least significant zero byte in fewer operations, using two different constants: [12]
x7 = x − 0x0101010101010101; byte_map = x7 & ~x & 0x8080808080808080;
For each byte b, this sets its msbit in byte_map
if the msbit of b − 1 is set and the msbit of b is clear, something that only happens if b = 0.
The preceding statement is only true if there is no borrow in; if there is a borrow, the condition will also be true if b = 1. However, such a borrow can only be generated by a less significant zero byte, so the least significant zero byte will be correctly identified, as desired.
Not only does this save one binary operation, but they are not all sequentially dependent, so it can be performed in two cycles assuming the existence of an "and not" (bit clear) instruction
As a generalization of a bitmap, it is possible to store very small lookup tables in a single register. For example, the number of days in a month varies from 28 to 31, a range of 4 values. This can be stored in 12×2 = 24 bits:
days_table = 0xeefbb3 + (is_leap_year << 2); days_in_month = 28 + (days_table >> 2*month & 3);
(This is assuming a 0-based month number. A 1-based month number can be accommodated by shifting the days_table
.)
The fact that the table fits neatly into one register makes it easy to modify for leap years.
x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. The 8086 was introduced in 1978 as a fully 16-bit extension of 8-bit Intel's 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486. Colloquially, their names were "186", "286", "386" and "486".
In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
AltiVec is a single-precision floating point and integer SIMD instruction set designed and owned by Apple, IBM, and Freescale Semiconductor — the AIM alliance. It is implemented on versions of the PowerPC processor architecture, including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's PWRficient PA6T. AltiVec is a trademark owned solely by Freescale, so the system is also referred to as Velocity Engine by Apple and VMX by IBM and P.A. Semi.
MMX is a single instruction, multiple data (SIMD) instruction set architecture designed by Intel, introduced on January 8, 1997 with its Pentium P5 (microarchitecture) based line of microprocessors, named "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on IA-32 processors by Intel and other vendors as of 1997. AMD also added MMX instruction set in its K6 processor.
In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in its Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within a register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.
3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector operations using vector registers. This improvement enhances the performance of many graphics-intensive applications. The first microprocessor to implement 3DNow! was the AMD K6-2, introduced in 1998. In appropriate applications, this enhancement raised the speed by about 2–4 times.
x86 assembly language is a family of low-level programming languages that are used to produce object code for the x86 class of processors. These languages provide backward compatibility with CPUs dating back to the Intel 8008 microprocessor, introduced in April 1972. As assembly languages, they are closely tied to the architecture's machine code instructions, allowing for precise control over hardware.
x86-64 is a 64-bit extension of the x86 instruction set architecture first announced in 1999. It introduces two new operating modes: 64-bit mode and compatibility mode, along with a new four-level paging mechanism.
SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.
The AMD Family 10h, or K10, is a microprocessor microarchitecture by AMD based on the K8 microarchitecture. The first third-generation Opteron products for servers were launched on September 10, 2007, with the Phenom processors for desktops following and launching on November 11, 2007 as the immediate successors to the K8 series of processors.
Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.
SSE4 is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation. SSE4 extended the SSE3 instruction set which was released in early 2004. All software using previous Intel SIMD instructions are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.
In numerical analysis, Estrin's scheme, also known as Estrin's method, is an algorithm for numerical evaluation of polynomials.
Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge microarchitecture shipping in Q1 2011 and later by AMD with the Bulldozer microarchitecture shipping in Q4 2011. AVX provides new features, new instructions, and a new coding scheme.
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.
The EVEX prefix and corresponding coding scheme is an extension to the 32-bit x86 (IA-32) and 64-bit x86-64 (AMD64) instruction set architecture. EVEX is based on, but should not be confused with the MVEX prefix used by the Knights Corner processor.
The x86 instruction set has several times been extended with SIMD instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.