SSE3

Last updated September 21, 2024

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI),^[1] is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU.^[1] In April 2005, AMD introduced a subset of SSE3 in revision E (Venice and San Diego) of their Athlon 64 CPUs.^[2] The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow! (developed by AMD, no longer supported on newer CPUs), SSE, and SSE2.

Changes

The most notable change is the capability to work horizontally in a register, as opposed to the more or less strictly vertical operation of all previous SSE instructions. More specifically, instructions to add and subtract the multiple values stored within a single register have been added.^[4] These instructions can be used to speed up the implementation of a number of DSP and 3D operations. There is also a new instruction to convert floating point values to integers without having to change the global rounding mode, thus avoiding costly pipeline stalls. Finally, the extension adds LDDQU, an alternative misaligned integer vector load that has better performance on NetBurst based platforms for loads that cross cacheline boundaries.^[5]

CPUs with SSE3

AMD:
- Opteron (since Stepping E4^[6])
- Sempron (since Palermo. Stepping E3)
- Athlon 64 (since Venice Stepping E3 and San Diego Stepping E4)
- Athlon 64 FX (since San Diego Stepping E4)
- Athlon 64 X2
- Phenom 64 X2
- Turion family
- K10 family
- APU family (including without GPU)
- FX Series
- Zen family
Intel:
- Celeron D
- Celeron (starting with Core microarchitecture)
- Pentium 4 (since Prescott)
- Pentium D
- Pentium Extreme Edition (but NOT Pentium 4 Extreme Edition)
- Pentium Dual-Core
- Pentium (starting with Core microarchitecture)
- Core
- Xeon (since Nocona^[7])
- Atom
VIA/Centaur:
- C7
- Nano
Transmeta Efficeon TM88xx (NOT Model Numbers TM86xx)

New instructions

Common instructions

Arithmetic

ADDSUBPD

Add-Subtract-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 − B0, A1 + B1 }

ADDSUBPS

Add-Subtract-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 − B0, A1 + B1, A2 − B2, A3 + B3 }

AOS ( Array Of Structures )

HADDPD

Horizontal-Add-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 + A1, B0 + B1 }

HADDPS

Horizontal-Add-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 + A1, A2 + A3, B0 + B1, B2 + B3 }

HSUBPD

Horizontal-Subtract-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 − A1, B0 − B1 }

HSUBPS

Horizontal-Subtract-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 − A1, A2 − A3, B0 − B1, B2 − B3 }

LDDQU

As stated above, this is an alternative misaligned integer vector load.^[8] It can be helpful for video compression tasks.

MOVDDUP , MOVSHDUP, MOVSLDUP^[4]

These are useful for complex numbers and wave calculation like sound.

FISTTP

Like the older x87 FISTP instruction, but ignores the floating point control register's rounding mode settings and uses the "chop" (truncate) mode instead.^[4] Allows omission of the expensive loading and re-loading of the control register in languages such as C where float-to-int conversion requires truncate behaviour by standard.

Other instructions

MONITOR, MWAIT: The MONITOR instruction is used to specify a memory address for monitoring, while the MWAIT instruction puts the processor into a low-power state and waits for a write event to the monitored address.^[4]

Related Research Articles

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. The 8086 was introduced in 1978 as a fully 16-bit extension of 8-bit Intel's 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486. Colloquially, their names were "186", "286", "386" and "486".

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

Pentium 4 is a series of single-core CPUs for desktops, laptops and entry-level servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 2008. It was removed from the official price lists starting in 2010, being replaced by Pentium Dual-Core.

The Athlon 64 is a ninth-generation, AMD64-architecture microprocessor produced by Advanced Micro Devices (AMD), released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP. The Athlon 64 was the second processor to implement the AMD64 architecture and the first 64-bit processor targeted at the average consumer. Variants of the Athlon 64 have been produced for Socket 754, Socket 939, Socket 940, and Socket AM2. It was AMD's primary consumer CPU, and primarily competed with Intel's Pentium 4, especially the Prescott and Cedar Mill core revisions.

Tejas was a code name for Intel's microprocessor, which was to be a successor to the latest Pentium 4 with the Prescott core and was sometimes referred to as Pentium V. Jayhawk was a code name for its Xeon counterpart. The cancellation of the processors in May 2004 underscored Intel's historical transition of its focus on single-core processors to multi-core processors.

3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector operations using vector registers. This improvement enhances the performance of many graphics-intensive applications. The first microprocessor to implement 3DNow! was the AMD K6-2, introduced in 1998. In appropriate applications, this enhancement raised the speed by about 2–4 times.

<span class="mw-page-title-main">AMD K6-III</span> Microprocessor series by AMD

The K6-III was an x86 microprocessor line manufactured by AMD that launched on February 22, 1999. The launch consisted of both 400 and 450 MHz models and was based on the preceding K6-2 architecture. Its improved 256 KB on-chip L2 cache gave it significant improvements in system performance over its predecessor the K6-2. The K6-III was the last processor officially released for desktop Socket 7 systems, however later mobile K6-III+ and K6-2+ processors could be run unofficially in certain socket 7 motherboards if an updated BIOS was made available for a given board. The Pentium III processor from Intel launched 6 days later.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator ; the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

In numerical analysis, Brent's method is a hybrid root-finding algorithm combining the bisection method, the secant method and inverse quadratic interpolation. It has the reliability of bisection but it can be as quick as some of the less-reliable methods. The algorithm tries to use the potentially fast-converging secant method or inverse quadratic interpolation if possible, but it falls back to the more robust bisection method if necessary. Brent's method is due to Richard Brent and builds on an earlier algorithm by Theodorus Dekker. Consequently, the method is also known as the Brent–Dekker method.

The Athlon 64 X2 is the first native dual-core desktop central processing unit (CPU) designed by Advanced Micro Devices (AMD). It was designed from scratch as native dual-core by using an already multi-CPU enabled Athlon 64, joining it with another functional core on one die, and connecting both via a shared dual-channel memory controller/north bridge and additional control logic. The initial versions are based on the E stepping model of the Athlon 64 and, depending on the model, have either 512 or 1024 KB of L2 cache per core. The Athlon 64 X2 can decode instructions for Streaming SIMD Extensions 3 (SSE3), except those few specific to Intel's architecture. The first Athlon 64 X2 CPUs were released in May 2005, in the same month as Intel's first dual-core processor, the Pentium D.

A carry-lookahead adder (CLA) or fast adder is a type of electronics adder used in digital logic. A carry-lookahead adder improves speed by reducing the amount of time required to determine carry bits. It can be contrasted with the simpler, but usually slower, ripple-carry adder (RCA), for which the carry bit is calculated alongside the sum bit, and each stage must wait until the previous carry bit has been calculated to begin calculating its own sum bit and carry bit. The carry-lookahead adder calculates one or more carry bits before the sum, which reduces the wait time to calculate the result of the larger-value bits of the adder.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

In music, standard tuning refers to the typical tuning of a string instrument. This notion is contrary to that of scordatura, i.e. an alternate tuning designated to modify either the timbre or technical capabilities of the desired instrument.

In electronics, a Ling adder is a particularly fast binary adder designed using H. Ling's equations and generally implemented in BiCMOS. Samuel Naffziger of Hewlett-Packard presented an innovative 64 bit adder in 0.5 μm CMOS based on Ling's equations at ISSCC 1996. The Naffziger adder's delay was less than 1 nanosecond, or 7 FO4.

Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.

In computing, the Kogge–Stone adder is a parallel prefix form of carry-lookahead adder. Other parallel prefix adders (PPA) include the Sklansky adder (SA), Brent–Kung adder (BKA), the Han–Carlson adder (HCA), the fastest known variation, the Lynch–Swartzlander spanning tree adder (STA), Knowles adder (KNA) and Beaumont-Smith adder (BSA).

Swiftweasel was a fork of Mozilla Firefox available for the Linux platform only.

References

1 2 Shimpi, Anand Lal; Wilson, Derek. "Intel's Pentium 4 E: Prescott Arrives with Luggage". www.anandtech.com. Retrieved 2023-04-10.
↑ Shimpi, Anand Lal. "Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more". www.anandtech.com. Retrieved 2023-04-10.
↑ "Intel Instruction Set Extensions Technology". Intel. Retrieved 2023-04-10.
1 2 3 4 Wright, Christopher. "SSE3 Instruction Set". softpixel.com. Retrieved 2023-04-10.
↑ "LDDQU — Load Unaligned Integer 128 Bits". www.felixcloutier.com. Retrieved 2023-04-10.
↑ Wilson, Derek. "AMD K8 E4 Stepping: SSE3 Performance". www.anandtech.com. Retrieved 2023-04-10.
↑ "Intel Xeon 3.4GHz ['Nocona' core]". HEXUS. 2004-08-18. Retrieved 2023-04-10.
1 2 3 4 5 6 7 "SSE3 Instructions - x86 Assembly Language Reference Manual". docs.oracle.com. Retrieved 2023-04-10.

External links

X-bit Labs

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:1-1] 1 2 Shimpi, Anand Lal; Wilson, Derek. "Intel's Pentium 4 E: Prescott Arrives with Luggage". www.anandtech.com. Retrieved 2023-04-10.

[2] Shimpi, Anand Lal. "Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more". www.anandtech.com. Retrieved 2023-04-10.

[3] "Intel Instruction Set Extensions Technology". Intel. Retrieved 2023-04-10.

[:2-4] 1 2 3 4 Wright, Christopher. "SSE3 Instruction Set". softpixel.com. Retrieved 2023-04-10.

[5] "LDDQU — Load Unaligned Integer 128 Bits". www.felixcloutier.com. Retrieved 2023-04-10.

[6] Wilson, Derek. "AMD K8 E4 Stepping: SSE3 Performance". www.anandtech.com. Retrieved 2023-04-10.

[7] "Intel Xeon 3.4GHz ['Nocona' core]". HEXUS. 2004-08-18. Retrieved 2023-04-10.

[:0-8] 1 2 3 4 5 6 7 "SSE3 Instructions - x86 Assembly Language Reference Manual". docs.oracle.com. Retrieved 2023-04-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Instruction set extensions
SIMD (RISC)	Alpha MVI ARM NEON SVE MIPS MDMX MIPS-3D MXU MIPS SIMD PA-RISC MAX Power ISA VMX SPARC VIS
SIMD (x86)	MMX (1996) 3DNow! (1998) SSE (1999) SSE2 (2001) SSE3 (2004) SSSE3 (2006) SSE4 (2006) SSE5 ~~(2007)~~ AVX (2008) F16C (2009) XOP (2009) FMA (FMA4: 2011, FMA3: 2012) AVX2 (2013) AVX-512 (2015) AMX (2022) AVX10 (2023)
Bit manipulation	BMI (ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012) ADX (2014)
Compressed instructions	Thumb MIPS16e ASE RVC
Security and cryptography	PadLock (2003) AES-NI (2008); ARMv8 also has AES instructions CLMUL (2010) RDRAND (2012) SHA (2013) MPX (2015) SGX (2015) TDX (2021)
Transactional memory	TSX (2013) ASF
Virtualization	VT-x (2005) AMD-V (2006) VT-d (AMD-Vi)
Suspended extensions' dates are ~~struck through~~.