This article may be too technical for most readers to understand.(July 2019) |
SSE4 (Streaming SIMD Extensions 4) is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum, with vague details in a white paper; [1] more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation. [2] SSE4 extended the SSE3 instruction set which was released in early 2004. All software using previous Intel SIMD instructions (ex. SSE3) are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4. [3]
Like other previous generation CPU SIMD instruction sets, SSE4 supports up to 16 registers, each 128-bits wide which can load four 32-bit integers, four 32-bit single precision floating point numbers, or two 64-bit double precision floating point numbers. [1] SIMD operations, such as vector element-wise addition/multiplication and vector scalar addition/multiplication, process multiple bytes of data in a single CPU instruction. The parallel operation packs noticeable increases in performance. SSE4.2 introduced new SIMD string operations, including an instruction to compare two string fragments of up to 16 bytes each. [1] SSE4.2 is a subset of SSE4 and it was released a few years after the initial release of SSE4.
Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the seven remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.
Starting with Barcelona-based processors, AMD introduced the SSE4a instruction set, which has four SSE4 instructions and four new SSE instructions. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 (the full SSE4 instruction set) in the Bulldozer-based FX processors. With SSE4a the misaligned SSE feature was also introduced which meant unaligned load instructions were as fast as aligned versions on aligned addresses. It also allowed disabling the alignment check on non-load SSE operations accessing memory. [4] Intel later introduced similar speed improvements to unaligned SSE in their Nehalem processors, but did not introduce misaligned access by non-load SSE instructions until AVX. [5]
What is now known as SSSE3 (Supplemental Streaming SIMD Extensions 3), introduced in the Intel Core 2 processor line, was referred to as SSE4 by some media until Intel came up with the SSSE3 moniker. Internally dubbed Merom New Instructions, Intel originally did not plan to assign a special name to them, which was criticized by some journalists. [6] Intel eventually cleared up the confusion and reserved the SSE4 name for their next instruction set extension. [7]
Intel is using the marketing term HD Boost to refer to SSE4. [8]
Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.
Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations reorder bytes within a register.)
These instructions were introduced with Penryn microarchitecture, the 45 nm shrink of Intel's Core microarchitecture. Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag.
Instruction | Description |
---|---|
MPSADBW | Compute eight offset sums of absolute differences, four at a time (i.e., |x0−y0|+|x1−y1|+|x2−y2|+|x3−y3|, |x0−y1|+|x1−y2|+|x2−y3|+|x3−y4|, ..., |x0−y7|+|x1−y8|+|x2−y9|+|x3−y10|); this operation is important for some HD codecs, and allows an 8×8 block difference to be computed in fewer than seven cycles. [9] One bit of a three-bit immediate operand indicates whether y0 .. y10 or y4 .. y14 should be used from the destination operand, the other two whether x0..x3, x4..x7, x8..x11 or x12..x15 should be used from the source. |
PHMINPOSUW | Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source. |
PMULDQ | Packed 32-bit signed "long" multiplication, two (1st and 3rd) out of four packed integers multiplied giving two packed 64-bit results. |
PMULLD | Packed 32-bit signed "low" multiplication, four packed sets of integers multiplied giving four packed 32-bit results. |
DPPS , DPPD | Dot product for AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output. |
BLENDPS , BLENDPD , BLENDVPS , BLENDVPD , PBLENDVB , PBLENDW | Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0. |
PMINSB , PMAXSB , PMINUW , PMAXUW , PMINUD , PMAXUD , PMINSD , PMAXSD | Packed minimum/maximum for different integer operand types |
ROUNDPS , ROUNDSS , ROUNDPD , ROUNDSD | Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand |
INSERTPS , PINSRB , PINSRD /PINSRQ , EXTRACTPS , PEXTRB , PEXTRD/PEXTRQ | The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register or memory location and inserts it into a field in the destination register given by an immediate operand. EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0. |
PMOVSXBW , PMOVZXBW , PMOVSXBD , PMOVZXBD , PMOVSXBQ , PMOVZXBQ , PMOVSXWD , PMOVZXWD , PMOVSXWQ , PMOVZXWQ , PMOVSXDQ , PMOVZXDQ | Packed sign/zero extension to wider types |
PTEST | This is similar to the TEST instruction, in that it sets the Z flag to the result of an AND between its operands: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero. This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set. |
PCMPEQQ | Quadword (64 bits) compare for equality |
PACKUSDW | Convert signed DWORDs into unsigned WORDs with saturation. |
MOVNTDQA | Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus. |
SSE4.2 added STTNI (String and Text New Instructions), [10] several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents. [11] It also added a CRC32
instruction to compute cyclic redundancy checks as used in certain data transfer protocols. These instructions were first implemented in the Nehalem-based Intel Core i7 product line, and complete the SSE4 instruction set. AMD on the other hand first added support starting with the Bulldozer microarchitecture. Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.
Windows 11 24H2 requires the CPU to support SSE4.2, otherwise the Windows kernel is unbootable. [12]
Instruction | Description |
---|---|
CRC32 | Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41). [13] [14] |
PCMPESTRI | Packed Compare Explicit Length Strings, Return Index |
PCMPESTRM | Packed Compare Explicit Length Strings, Return Mask |
PCMPISTRI | Packed Compare Implicit Length Strings, Return Index |
PCMPISTRM | Packed Compare Implicit Length Strings, Return Mask |
PCMPGTQ | Compare Packed Signed 64-bit data For Greater Than |
POPCNT
and LZCNT
These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT
beginning with the Nehalem microarchitecture and LZCNT
beginning with the Haswell microarchitecture. AMD implements both, beginning with the Barcelona microarchitecture.
AMD calls this pair of instructions Advanced Bit Manipulation (ABM).
The encoding of LZCNT
takes the same encoding path as the encoding of the BSR
(bit scan reverse) instruction. This results in an issue where LZCNT
called on some CPUs not supporting it, such as Intel CPUs prior to Haswell, may incorrectly execute the BSR
operation instead of raising an invalid instruction exception. This is an issue as the result values of LZCNT
and BSR
are different.
Trailing zeros can be counted using the BSF
(bit scan forward) or TZCNT
instructions.
Windows 11 24H2 requires the CPU to support POPCNT
, otherwise the Windows kernel is unbootable. [15]
Instruction | Description |
---|---|
POPCNT | Population count (count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag. [16] |
LZCNT | Leading zero count. Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag. [17] |
The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag. [17]
Instruction | Description |
---|---|
EXTRQ /INSERTQ | Combined mask-shift instructions. [18] |
MOVNTSD /MOVNTSS | Scalar streaming store instructions. [19] |
X86-64 v2 CPUs:
POPCNT
supported)POPCNT
supported)POPCNT
supported)POPCNT
supported)POPCNT
supported, except Pentium and Celeron)POPCNT
supported, include Pentium and Celeron)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported) POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)POPCNT
and LZCNT
supported)MMX is a single instruction, multiple data (SIMD) instruction set architecture designed by Intel, introduced on January 8, 1997 with its Pentium P5 (microarchitecture) based line of microprocessors, named "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on IA-32 processors by Intel and other vendors as of 1997. AMD also added MMX instruction set in its K6 processor.
In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in its Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.
SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow!, SSE, and SSE2.
The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.
The P6 microarchitecture is the sixth-generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is frequently referred to as i686. It was planned to be succeeded by the NetBurst microarchitecture used by the Pentium 4 in 2000, but was revived for the Pentium M line of microprocessors. The successor to the Pentium M variant of the P6 microarchitecture is the Core microarchitecture which in turn is also derived from P6.
The Intel Core microarchitecture is a multi-core processor microarchitecture launched by Intel in mid-2006. It is a major evolution over the Yonah, the previous iteration of the P6 microarchitecture series which started in 1995 with Pentium Pro. It also replaced the NetBurst microarchitecture, which suffered from high power consumption and heat intensity due to an inefficient pipeline designed for high clock rate. In early 2004 the new version of NetBurst (Prescott) needed very high power to reach the clocks it needed for competitive performance, making it unsuitable for the shift to dual/multi-core CPUs. On May 7, 2004 Intel confirmed the cancellation of the next NetBurst, Tejas and Jayhawk. Intel had been developing Merom, the 64-bit evolution of the Pentium M, since 2001, and decided to expand it to all market segments, replacing NetBurst in desktop computers and servers. It inherited from Pentium M the choice of a short and efficient pipeline, delivering superior performance despite not reaching the high clocks of NetBurst.
The AMD Family 10h, or K10, is a microprocessor microarchitecture by AMD based on the K8 microarchitecture. The first third-generation Opteron products for servers were launched on September 10, 2007, with the Phenom processors for desktops following and launching on November 11, 2007 as the immediate successors to the K8 series of processors.
In the x86 architecture, the CPUID instruction is a processor supplementary instruction allowing software to discover details of the processor. It was introduced by Intel in 1993 with the launch of the Pentium and SL-enhanced 486 processors.
Nehalem is the codename for Intel's 45 nm microarchitecture released in November 2008. It was used in the first generation of the Intel Core i5 and i7 processors, and succeeds the older Core microarchitecture used on Core 2 processors. The term "Nehalem" comes from the Nehalem River.
The VIA Nano is a 64-bit CPU for personal computers. The VIA Nano was released by VIA Technologies in 2008 after five years of development by its CPU division, Centaur Technology. This new Isaiah 64-bit architecture was designed from scratch, unveiled on 24 January 2008, and launched on 29 May, including low-voltage variants and the Nano brand name. The processor supports a number of VIA-specific x86 extensions designed to boost efficiency in low-power appliances.
Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge microarchitecture shipping in Q1 2011 and later by AMD with the Bulldozer microarchitecture shipping in Q4 2011. AVX provides new features, new instructions, and a new coding scheme.
Yorkfield is the code name for some Intel processors sold as Core 2 Quad and Xeon. In Intel's Tick-Tock cycle, the 2007/2008 "Tick" was Penryn microarchitecture, the shrink of the Core microarchitecture to 45 nanometers as CPUID model 23, replacing Kentsfield, the previous model.
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.
Bit manipulation instructions sets are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers.
The Puma Family 16h is a low-power microarchitecture by AMD for its APUs. It succeeds the Jaguar as a second-generation version, targets the same market, and belongs to the same AMD architecture Family 16h. The Beema line of processors are aimed at low-power notebooks, and Mullins are targeting the tablet sector.