SWAR

Last updated February 23, 2024

SIMD within a register (SWAR), also known by the name "packed SIMD"^[1] is a technique for performing parallel operations on data contained in a processor register. SIMD stands for single instruction, multiple data. Flynn's 1972 taxonomy categorises SWAR as "pipelined processing".

Many modern general-purpose computer processors have some provisions for SIMD, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks.^[3]

SWAR architectures

A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose.

An early example of a SWAR architecture was the Intel Pentium with MMX, which implemented the MMX extension set. The Intel Pentium, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques.

Early SWAR architectures include DEC Alpha MVI, Hewlett-Packard's PA-RISC MAX, Silicon Graphics Incorporated's MIPS MDMX, and Sun's SPARC V9 VIS. Like MMX, many of the SWAR instruction sets are intended for faster video coding.^[4]

History of the SWAR programming model

Wesley A. Clark introduced partitioned subword data operations in the 1950s^{[ citation needed ]}. This can be seen as a very early predecessor to SWAR. Leslie Lamport presented SWAR techniques in his paper titled "Multiple byte processing with full-word instructions"^[5] in 1975.

With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code.

In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect of MasPar's MPL called NEMPL (Not Exactly MPL).

During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies.

Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's VIS, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called new media applications. These extensions had significant differences in the precision of data and types of instructions supported.

Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University.^[6] It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.

The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98.^[4]

Some applications of SWAR

SWAR processing has been used in image processing,^[7] cryptographic pairings,^[8] raster processing,^[9] computational fluid dynamics,^[10] and communications.^[11]

Related Research Articles

<span class="mw-page-title-main">AMD K6</span> Computer microprocessor

The K6 microprocessor was launched by AMD in 1997. The main advantage of this particular microprocessor is that it was designed to fit into existing desktop designs for Pentium-branded CPUs. It was marketed as a product that could perform as well as its Intel Pentium II equivalent but at a significantly lower price. The K6 had a considerable impact on the PC market and presented Intel with serious competition.

The Pentium is a x86 microprocessor introduced by Intel on March 22, 1993. It is the first CPU using the Pentium brand. Considered the fifth generation in the 8086 compatible line of processors, its implementation and microarchitecture was internally called P5.

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

In computer science, an instruction set architecture (ISA) is a part of the abstract model of a computer, which generally defines how software controls the CPU. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

<span class="mw-page-title-main">MMX (instruction set)</span> Instruction set designed by Intel

MMX is a single instruction, multiple data (SIMD) instruction set architecture designed by Intel, introduced on January 8, 1997 with its Pentium P5 (microarchitecture) based line of microprocessors, named "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on IA-32 processors by Intel and other vendors as of 1997.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

IA-64 is the instruction set architecture (ISA) of the discontinued Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in collaboration with HP. The first Itanium processor, codenamed Merced, was released in 2001.

The Intel i860 is a RISC microprocessor design introduced by Intel in 1989. It is one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel iAPX 432 from the beginning of the 1980s. It was the world's first million-transistor chip. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems. The i860 never achieved commercial success and the project was terminated in the mid-1990s.

The Pentium II brand refers to Intel's sixth-generation microarchitecture ("P6") and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million transistors. However, its L2 cache subsystem was a downgrade when compared to the Pentium Pros. It is a single-core microprocessor.

In computer architecture, predication is a feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.

3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector operations using vector registers. This improvement enhances the performance of many graphics-intensive applications. The first microprocessor to implement 3DNow! was the AMD K6-2, introduced in 1998. In appropriate applications, this enhancement raised the speed by about 2–4 times.

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in the design of modern processors and their functionalities. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming context has evolved as an extension of the classification system. Vector processing, covered by Duncan's taxonomy, is missing from Flynn's work because the Cray-1 was released in 1977: Flynn's second paper was published in 1972.

The K6-2 is an x86 microprocessor introduced by AMD on May 28, 1998, and available in speeds ranging from 266 to 550 MHz. An enhancement of the original K6, the K6-2 introduced AMD's 3DNow! SIMD instruction set and an upgraded system-bus interface called Super Socket 7, which was backward compatible with older Socket 7 motherboards. It was manufactured using a 250 nanometer process, ran at 2.2 volts, and had 9.3 million transistors.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. In computer architecture, registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address e.g. DEC PDP-10, ICT 1900.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.

Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions, and a new coding scheme.

Open Watcom Assembler or WASM is an x86 assembler produced by Watcom, based on the Watcom Assembler found in Watcom C/C++ compiler and Watcom FORTRAN 77. Further development is being done on the 32- and 64-bit JWASM project, which more closely matches the syntax of Microsoft's assembler.

References

↑ Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002). An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176. doi:10.1109/APCCAS.2002.1114930. hdl: 2065/10689 .
↑ Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers . C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.
↑ Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.
1 2 Fisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register". Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.
↑ Lamport, Leslie (August 1975). "Multiple byte processing with full-word instructions". Communications of the ACM. 18 (8): 471–475. doi: 10.1145/360933.360994 . S2CID 1593593.
↑ Dietz, Hank. "The Aggregate Magic Algorithms".
↑ Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (January 2001). Improving processing time of large images by instruction level parallelism (PDF). Chilean Computing Week, V Workshop on Parallel and Distributed Systems. Punta Arenas. Archived from the original (PDF) on 2007-02-25.
↑ Grabher, Philipp; Johann Großschädl; Dan Page (2009). "On Software Parallel Implementation of Cryptographic Pairings". Selected Areas in Cryptography. Lecture Notes in Computer Science. Vol. 5381. pp. 35–50. doi: 10.1007/978-3-642-04159-4_3 . ISBN 978-3-642-04158-7.
↑ Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS". Proceedings of the FOSS/GRASS Users Conference 2004.
↑ Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software". SIAM Journal on Scientific Computing. 25 (4): 1461–1477. doi:10.1137/S1064827502410530. ISSN 1064-8275.
↑ Spracklen, Lawrence A. (2001). SWAR Systems and Communications Applications (PDF) (Ph.D.). University of Aberdeen.

External links

The Aggregate - SWAR: SIMD Within A Register

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002). An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176. doi:10.1109/APCCAS.2002.1114930. hdl: 2065/10689 .

[flynn-1972-2] Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers . C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.

[3] Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.

[LCPC98-4] 1 2 Fisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register". Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.

[5] Lamport, Leslie (August 1975). "Multiple byte processing with full-word instructions". Communications of the ACM. 18 (8): 471–475. doi: 10.1145/360933.360994 . S2CID 1593593.

[6] Dietz, Hank. "The Aggregate Magic Algorithms".

[7] Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (January 2001). Improving processing time of large images by instruction level parallelism (PDF). Chilean Computing Week, V Workshop on Parallel and Distributed Systems. Punta Arenas. Archived from the original (PDF) on 2007-02-25.

[8] Grabher, Philipp; Johann Großschädl; Dan Page (2009). "On Software Parallel Implementation of Cryptographic Pairings". Selected Areas in Cryptography. Lecture Notes in Computer Science. Vol. 5381. pp. 35–50. doi: 10.1007/978-3-642-04159-4_3 . ISBN 978-3-642-04158-7.

[9] Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS". Proceedings of the FOSS/GRASS Users Conference 2004.

[10] Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software". SIAM Journal on Scientific Computing. 25 (4): 1461–1477. doi:10.1137/S1064827502410530. ISSN 1064-8275.

[11] Spracklen, Lawrence A. (2001). SWAR Systems and Communications Applications (PDF) (Ph.D.). University of Aberdeen.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]