SSE3

Last updated

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), [1] is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. [1] In April 2005, AMD introduced a subset of SSE3 in revision E (Venice and San Diego) of their Athlon 64 CPUs. [2] The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow! (developed by AMD, no longer supported on newer CPUs), SSE, and SSE2.

Contents

SSE3 contains 13 new instructions over SSE2. [3]

Changes

The most notable change is the capability to work horizontally in a register, as opposed to the more or less strictly vertical operation of all previous SSE instructions. More specifically, instructions to add and subtract the multiple values stored within a single register have been added. [4] These instructions can be used to speed up the implementation of a number of DSP and 3D operations. There is also a new instruction to convert floating point values to integers without having to change the global rounding mode, thus avoiding costly pipeline stalls. Finally, the extension adds LDDQU, an alternative misaligned integer vector load that has better performance on NetBurst based platforms for loads that cross cacheline boundaries. [5]

CPUs with SSE3

New instructions

Common instructions

Arithmetic

ADDSUBPD
Add-Subtract-Packed-Double [8]
  • Input: { A0, A1 }, { B0, B1 }
  • Output: { A0 − B0, A1 + B1 }
ADDSUBPS
Add-Subtract-Packed-Single [8]
  • Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
  • Output: { A0 − B0, A1 + B1, A2 − B2, A3 + B3 }

AOS ( Array Of Structures )

HADDPD
Horizontal-Add-Packed-Double [8]
  • Input: { A0, A1 }, { B0, B1 }
  • Output: { A0 + A1, B0 + B1 }
HADDPS
Horizontal-Add-Packed-Single [8]
  • Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
  • Output: { A0 + A1, A2 + A3, B0 + B1, B2 + B3 }
HSUBPD
Horizontal-Subtract-Packed-Double [8]
  • Input: { A0, A1 }, { B0, B1 }
  • Output: { A0 − A1, B0 − B1 }
HSUBPS
Horizontal-Subtract-Packed-Single [8]
  • Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
  • Output: { A0 − A1, A2 − A3, B0 − B1, B2 − B3 }
LDDQU
As stated above, this is an alternative misaligned integer vector load. [8] It can be helpful for video compression tasks.
MOVDDUP , MOVSHDUP, MOVSLDUP [4]
These are useful for complex numbers and wave calculation like sound.
FISTTP
Like the older x87 FISTP instruction, but ignores the floating point control register's rounding mode settings and uses the "chop" (truncate) mode instead. [4] Allows omission of the expensive loading and re-loading of the control register in languages such as C where float-to-int conversion requires truncate behaviour by standard.

Other instructions

MONITOR, MWAIT
The MONITOR instruction is used to specify a memory address for monitoring, while the MWAIT instruction puts the processor into a low-power state and waits for a write event to the monitored address. [4]

Related Research Articles

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

<span class="mw-page-title-main">Pentium 4</span> Brand by Intel

Pentium 4 is a series of single-core CPUs for desktops, laptops and entry-level servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 2008. It was removed from the official price lists starting in 2010, being replaced by Pentium Dual-Core.

<span class="mw-page-title-main">Athlon 64</span> Series of CPUs by AMD

The Athlon 64 is a ninth-generation, AMD64-architecture microprocessor produced by Advanced Micro Devices (AMD), released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP. The Athlon 64 was the second processor to implement the AMD64 architecture and the first 64-bit processor targeted at the average consumer. Variants of the Athlon 64 have been produced for Socket 754, Socket 939, Socket 940, and Socket AM2. It was AMD's primary consumer CPU, and primarily competed with Intel's Pentium 4, especially the Prescott and Cedar Mill core revisions.

Tejas was a code name for Intel's microprocessor, which was to be a successor to the latest Pentium 4 with the Prescott core and was sometimes referred to as Pentium V. Jayhawk was a code name for its Xeon counterpart. The cancellation of the processors in May 2004 underscored Intel's historical transition of its focus on single-core processors to multi-core processors.

3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector operations using vector registers. This improvement enhances the performance of many graphics-intensive applications. The first microprocessor to implement 3DNow! was the AMD K6-2, introduced in 1998. In appropriate applications, this enhancement raised the speed by about 2–4 times.

<span class="mw-page-title-main">AMD K6-III</span> Microprocessor series by AMD

The K6-III was an x86 microprocessor line manufactured by AMD that launched on February 22, 1999. The launch consisted of both 400 and 450 MHz models and was based on the preceding K6-2 architecture. Its improved 256 KB on-chip L2 cache gave it significant improvements in system performance over its predecessor the K6-2. The K6-III was the last processor officially released for desktop Socket 7 systems, however later mobile K6-III+ and K6-2+ processors could be run unofficially in certain socket 7 motherboards if an updated BIOS was made available for a given board. The Pentium III processor from Intel launched 6 days later.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator ; the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:

In numerical analysis, Brent's method is a hybrid root-finding algorithm combining the bisection method, the secant method and inverse quadratic interpolation. It has the reliability of bisection but it can be as quick as some of the less-reliable methods. The algorithm tries to use the potentially fast-converging secant method or inverse quadratic interpolation if possible, but it falls back to the more robust bisection method if necessary. Brent's method is due to Richard Brent and builds on an earlier algorithm by Theodorus Dekker. Consequently, the method is also known as the Brent–Dekker method.

<span class="mw-page-title-main">Athlon 64 X2</span> Series of CPUs by AMD

The Athlon 64 X2 is the first native dual-core desktop central processing unit (CPU) designed by Advanced Micro Devices (AMD). It was designed from scratch as native dual-core by using an already multi-CPU enabled Athlon 64, joining it with another functional core on one die, and connecting both via a shared dual-channel memory controller/north bridge and additional control logic. The initial versions are based on the E stepping model of the Athlon 64 and, depending on the model, have either 512 or 1024 KB of L2 cache per core. The Athlon 64 X2 can decode instructions for Streaming SIMD Extensions 3 (SSE3), except those few specific to Intel's architecture. The first Athlon 64 X2 CPUs were released in May 2005, in the same month as Intel's first dual-core processor, the Pentium D.

The MixColumns operation performed by the Rijndael cipher or Advanced Encryption Standard is, along with the ShiftRows step, its primary source of diffusion. Each column of bytes is treated as a four-term polynomial , each byte representing an element in the Galois field . The coefficients are elements within the prime sub-field .

A carry-lookahead adder (CLA) or fast adder is a type of electronics adder used in digital logic. A carry-lookahead adder improves speed by reducing the amount of time required to determine carry bits. It can be contrasted with the simpler, but usually slower, ripple-carry adder (RCA), for which the carry bit is calculated alongside the sum bit, and each stage must wait until the previous carry bit has been calculated to begin calculating its own sum bit and carry bit. The carry-lookahead adder calculates one or more carry bits before the sum, which reduces the wait time to calculate the result of the larger-value bits of the adder.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

In music, standard tuning refers to the typical tuning of a string instrument. This notion is contrary to that of scordatura, i.e. an alternate tuning designated to modify either the timbre or technical capabilities of the desired instrument.

In electronics, a Ling adder is a particularly fast binary adder designed using H. Ling's equations and generally implemented in BiCMOS. Samuel Naffziger of Hewlett-Packard presented an innovative 64 bit adder in 0.5 μm CMOS based on Ling's equations at ISSCC 1996. The Naffziger adder's delay was less than 1 nanosecond, or 7 FO4.

<span class="mw-page-title-main">Kogge–Stone adder</span> Arithmetic logic circuit

In computing, the Kogge–Stone adder is a parallel prefix form of carry-lookahead adder. Other parallel prefix adders (PPA) include the Sklansky adder (SA), Brent–Kung adder (BKA), the Han–Carlson adder (HCA), the fastest known variation, the Lynch–Swartzlander spanning tree adder (STA), Knowles adder (KNA) and Beaumont-Smith adder (BSA).

Swiftweasel was a fork of Mozilla Firefox available for the Linux platform only.

References

  1. 1 2 Wilson, Anand Lal Shimpi & Derek. "Intel's Pentium 4 E: Prescott Arrives with Luggage". www.anandtech.com. Retrieved 2023-04-10.
  2. Shimpi, Anand Lal. "Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more". www.anandtech.com. Retrieved 2023-04-10.
  3. "Intel Instruction Set Extensions Technology". Intel. Retrieved 2023-04-10.
  4. 1 2 3 4 Wright, Christopher. "SSE3 Instruction Set". softpixel.com. Retrieved 2023-04-10.
  5. "LDDQU — Load Unaligned Integer 128 Bits". www.felixcloutier.com. Retrieved 2023-04-10.
  6. Wilson, Derek. "AMD K8 E4 Stepping: SSE3 Performance". www.anandtech.com. Retrieved 2023-04-10.
  7. "Intel Xeon 3.4GHz ['Nocona' core]". HEXUS. 2004-08-18. Retrieved 2023-04-10.
  8. 1 2 3 4 5 6 7 "SSE3 Instructions - x86 Assembly Language Reference Manual". docs.oracle.com. Retrieved 2023-04-10.