Last updated

SIMD within a register (SWAR) is a technique for performing parallel operations on data contained in a processor register. SIMD stands for single instruction, multiple data.


Many modern general-purpose computer processors have some provisions for SIMD, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks. [1]

SWAR architectures

A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose.

An early example of a SWAR architecture was the Intel Pentium with MMX, which implemented the MMX extension set. The Intel Pentium, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques.

Early SWAR architectures include DEC Alpha MVI, Hewlett-Packard's PA-RISC MAX, Silicon Graphics Incorporated's MIPS MDMX, and Sun's SPARC V9 VIS. Like MMX, many of the SWAR instruction sets are intended for faster video coding. [2]

History of the SWAR programming model

Wesley A. Clark introduced partitioned subword data operations in the 1950s. This can be seen as a very early predecessor to SWAR.

With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code.

In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect of MasPar's MPL called NEMPL (Not Exactly MPL).

During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies.

Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's VIS, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called new media applications. These extensions had significant differences in the precision of data and types of instructions supported.

Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University. [3] It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.

The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98. [2]

Some applications of SWAR

SWAR processing has been used in image processing, [4] cryptographic pairings, [5] raster processing. [6] Computational Fluid Dynamics, [7] and communications. [8]

See also

Related Research Articles

P5 (microarchitecture) Intel microporocessor

The first Pentium microprocessor was introduced by Intel on March 22, 1993. Its P5 microarchitecture was the fifth generation for Intel, and the first superscalar IA-32 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster floating-point unit, wider data bus, separate code and data caches and features for further reduced address calculation latency. In October 1996, the Pentium with MMX Technology was introduced, complementing the same basic microarchitecture with the MMX instruction set, larger caches, and some other enhancements.

x86 Family of instruction set architectures

x86 is a family of instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors.

SIMD class of parallel computers in Flynns taxonomy, with multiple processing elements that perform the same operation on multiple data points simultaneously

Single instruction, multiple data (SIMD) is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment. SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions to improve the performance of multimedia use. SIMD is not to be confused with SIMT, which utilizes threads.

MMX (instruction set) instruction set designed by Intel

MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in January 1997 with its P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on recent IA-32 processors by Intel and other vendors.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

Visual Instruction Set, or VIS, is a SIMD instruction set extension for SPARC V9 microprocessors developed by Sun Microsystems. There are five versions of VIS: VIS 1, VIS 2, VIS 2+, VIS 3 and VIS 4.

Pentium II family of Intel microprocessors

The Pentium II brand refers to Intel's sixth-generation microarchitecture ("P6") and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million transistors. However, its L2 cache subsystem was a downgrade when compared to the Pentium Pro's.

Pentium III Line of desktop and mobile microprocessors produced by Intel

The Pentium III brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitecture introduced on February 26, 1999. The brand's initial processors were very similar to the earlier Pentium II-branded microprocessors. The most notable differences were the addition of the SSE instruction set, and the introduction of a controversial serial number embedded in the chip during the manufacturing process.

3DNow! is an extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing, which improves the performance of many graphic-intensive applications. The first microprocessor to implement 3DNow was the AMD K6-2, which was introduced in 1998. When the application was appropriate this raised the speed by about 2–4 times.

x86 Assembly Language is a family of backward-compatible assembly languages, which provide some level of compatibility all the way back to the Intel 8008 introduced in April 1972. x86 assembly languages are used to produce object code for the x86 class of processors. Like all assembly languages, it uses short mnemonics to represent the fundamental instructions that the CPU in a computer can understand and follow. Compilers sometimes produce assembly code as an intermediate step when translating a high level program into machine code. Regarded as a programming language, assembly coding is machine-specific and low level. Assembly languages are more typically used for detailed and time critical applications such as small real-time embedded systems or operating system kernels and device drivers.

AMD K6-2 family of microprocessors introduced in 1998

The K6-2 is an x86 microprocessor introduced by AMD on May 28, 1998, and available in speeds ranging from 266 to 550 MHz. An enhancement of the original K6, the K6-2 introduced AMD's 3DNow! SIMD instruction set, featured a larger 64 KiB Level 1 cache, and an upgraded system-bus interface called Super Socket 7, which was backward compatible with older Socket 7 motherboards. It was manufactured using a 0.25 micrometre process, ran at 2.2 volts, and had 9.3 million transistors.

SSE2 is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow!, SSE, and SSE2.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

x87 is a floating-point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating-point coprocessors that worked in tandem with corresponding x86 CPUs. These microchips had names ending in "87". This was also known as the NPX. Like other extensions to the basic instruction set, x87 instructions are not strictly needed to construct working programs, but provide hardware and microcode implementations of common numerical tasks, allowing these tasks to be performed much faster than corresponding machine code routines can. The x87 instruction set includes instructions for basic floating-point operations such as addition, subtraction and comparison, but also for more complex numerical operations, such as the computation of the tangent function and its inverse, for example.

Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.

Intel C++ Compiler, also known as icc or icl, is a group of C and C++ compilers from Intel available for Windows, Mac, Linux, FreeBSD and Intel-based Android devices.

Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

Extended MMX refers to one of two possible extensions to the MMX instruction set for x86.

Open Watcom Assembler or WASM is an x86 assembler produced by Watcom, based on the Watcom Assembler found in Watcom C/C++ compiler and Watcom FORTRAN 77. Further development is being done on the 32- and 64-bit JWASM project,. which more closely matches the syntax of Microsoft's assembler.


  1. Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.
  2. 1 2 Fisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register". Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.
  3. Dietz, Hank. "The Aggregate Magic Algorithms".
  4. Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (2001). "Improving processing time of large images by instruction level parallelism" (PDF).Cite journal requires |journal= (help)
  5. Grabher, Philipp; Johann Großschädl; Dan Page (2009). On Software Parallel Implementation of Cryptographic Pairings. Selected Areas in Cryptography. Lecture Notes in Computer Science. 5381. pp. 35–50. doi:10.1007/978-3-642-04159-4_3. ISBN   978-3-642-04158-7.
  6. Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS". Proceedings of the FOSS/GRASS Users Conference 2004.
  7. Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software". SIAM Journal on Scientific Computing. 25 (4): 1461–1477. doi:10.1137/S1064827502410530. ISSN   1064-8275.
  8. Spracklen, Lawrence A. (2001). SWAR Systems and Communications Applications (PDF) (Ph.D.). University of Aberdeen.