X86 SIMD instruction listings

Last updated

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Contents

MMX instructions

MMX instructions operate on the mm registers, which are 64 bits wide. They are shared with the FPU registers.

Original MMX instructions

Added with Pentium MMX

InstructionOpcodeMeaningNotes
EMMS0F 77Empty MMX Technology StateMarks all x87 FPU registers for use by FPU
MOVD mm, r/m320F 6E /rMove doubleword
MOVD r/m32, mm0F 7E /rMove doubleword
MOVQ mm/m64, mm0F 7F /rMove quadword
MOVQ mm, mm/m640F 6F /rMove quadword
MOVQ mm, r/m64REX.W + 0F 6E /rMove quadword
MOVQ r/m64, mmREX.W + 0F 7E /rMove quadword
PACKSSDW mm1, mm2/m640F 6B /rPack doublewords to words (signed with saturation)
PACKSSWB mm1, mm2/m640F 63 /rPack words to bytes (signed with saturation)
PACKUSWB mm, mm/m640F 67 /rPack words to bytes (unsigned with saturation)
PADDB mm, mm/m640F FC /rAdd packed byte integers
PADDW mm, mm/m640F FD /rAdd packed word integers
PADDD mm, mm/m640F FE /rAdd packed doubleword integers
PADDSB mm, mm/m640F EC /rAdd packed signed byte integers and saturate
PADDSW mm, mm/m640F ED /rAdd packed signed word integers and saturate
PADDUSB mm, mm/m640F DC /rAdd packed unsigned byte integers and saturate
PADDUSW mm, mm/m640F DD /rAdd packed unsigned word integers and saturate
PAND mm, mm/m640F DB /rBitwise AND
PANDN mm, mm/m640F DF /rBitwise AND NOT
POR mm, mm/m640F EB /rBitwise OR
PXOR mm, mm/m640F EF /rBitwise XOR
PCMPEQB mm, mm/m640F 74 /rCompare packed bytes for equality
PCMPEQW mm, mm/m640F 75 /rCompare packed words for equality
PCMPEQD mm, mm/m640F 76 /rCompare packed doublewords for equality
PCMPGTB mm, mm/m640F 64 /rCompare packed signed byte integers for greater than
PCMPGTW mm, mm/m640F 65 /rCompare packed signed word integers for greater than
PCMPGTD mm, mm/m640F 66 /rCompare packed signed doubleword integers for greater than
PMADDWD mm, mm/m640F F5 /rMultiply packed words, add adjacent doubleword results
PMULHW mm, mm/m640F E5 /rMultiply packed signed word integers, store high 16 bits of results
PMULLW mm, mm/m640F D5 /rMultiply packed signed word integers, store low 16 bits of results
PSLLW mm1, imm80F 71 /6 ibShift left words, shift in zeros
PSLLW mm, mm/m640F F1 /rShift left words, shift in zeros
PSLLD mm, imm80F 72 /6 ibShift left doublewords, shift in zeros
PSLLD mm, mm/m640F F2 /rShift left doublewords, shift in zeros
PSLLQ mm, imm80F 73 /6 ibShift left quadword, shift in zeros
PSLLQ mm, mm/m640F F3 /rShift left quadword, shift in zeros
PSRAD mm, imm80F 72 /4 ibShift right doublewords, shift in sign bits
PSRAD mm, mm/m640F E2 /rShift right doublewords, shift in sign bits
PSRAW mm, imm80F 71 /4 ibShift right words, shift in sign bits
PSRAW mm, mm/m640F E1 /rShift right words, shift in sign bits
PSRLW mm, imm80F 71 /2 ibShift right words, shift in zeros
PSRLW mm, mm/m640F D1 /rShift right words, shift in zeros
PSRLD mm, imm80F 72 /2 ibShift right doublewords, shift in zeros
PSRLD mm, mm/m640F D2 /rShift right doublewords, shift in zeros
PSRLQ mm, imm80F 73 /2 ibShift right quadword, shift in zeros
PSRLQ mm, mm/m640F D3 /rShift right quadword, shift in zeros
PSUBB mm, mm/m640F F8 /rSubtract packed byte integers
PSUBW mm, mm/m640F F9 /rSubtract packed word integers
PSUBD mm, mm/m640F FA /rSubtract packed doubleword integers
PSUBSB mm, mm/m640F E8 /rSubtract signed packed bytes with saturation
PSUBSW mm, mm/m640F E9 /rSubtract signed packed words with saturation
PSUBUSB mm, mm/m640F D8 /rSubtract unsigned packed bytes with saturation
PSUBUSW mm, mm/m640F D9 /rSubtract unsigned packed words with saturation
PUNPCKHBW mm, mm/m640F 68 /rUnpack and interleave high-order bytes
PUNPCKHWD mm, mm/m640F 69 /rUnpack and interleave high-order words
PUNPCKHDQ mm, mm/m640F 6A /rUnpack and interleave high-order doublewords
PUNPCKLBW mm, mm/m320F 60 /rUnpack and interleave low-order bytes
PUNPCKLWD mm, mm/m320F 61 /rUnpack and interleave low-order words
PUNPCKLDQ mm, mm/m320F 62 /rUnpack and interleave low-order doublewords

MMX instructions added in specific processors

MMX instructions added with MMX+ and SSE

The following MMX instruction were added with SSE. They are also available on the Athlon under the name MMX+.

InstructionOpcodeMeaning
MASKMOVQ mm1, mm20F F7 /rMasked Move of Quadword
MOVNTQ m64, mm0F E7 /rMove Quadword Using Non-Temporal Hint
PSHUFW mm1, mm2/m64, imm80F 70 /r ibShuffle Packed Words
PINSRW mm, r32/m16, imm80F C4 /rInsert Word
PEXTRW reg, mm, imm80F C5 /rExtract Word
PMOVMSKB reg, mm0F D7 /rMove Byte Mask
PMINUB mm1, mm2/m640F DA /rMinimum of Packed Unsigned Byte Integers
PMAXUB mm1, mm2/m640F DE /rMaximum of Packed Unsigned Byte Integers
PAVGB mm1, mm2/m640F E0 /rAverage Packed Integers
PAVGW mm1, mm2/m640F E3 /rAverage Packed Integers
PMULHUW mm1, mm2/m640F E4 /rMultiply Packed Unsigned Integers and Store High Result
PMINSW mm1, mm2/m640F EA /rMinimum of Packed Signed Word Integers
PMAXSW mm1, mm2/m640F EE /rMaximum of Packed Signed Word Integers
PSADBW mm1, mm2/m640F F6 /rCompute Sum of Absolute Differences
MMX instructions added with SSE2

The following MMX instructions were added with SSE2:

InstructionOpcodeMeaning
PADDQ mm, mm/m640F D4 /rAdd packed quadword integers
PSUBQ mm1, mm2/m640F FB /rSubtract packed quadword integers
PMULUDQ mm1, mm2/m640F F4 /rMultiply unsigned doubleword integer
MMX instructions added with SSSE3
InstructionOpcodeMeaning
PSIGNB mm1, mm2/m640F 38 08 /rNegate/zero/preserve packed byte integers depending on corresponding sign
PSIGNW mm1, mm2/m640F 38 09 /rNegate/zero/preserve packed word integers depending on corresponding sign
PSIGND mm1, mm2/m640F 38 0A /rNegate/zero/preserve packed doubleword integers depending on corresponding sign
PSHUFB mm1, mm2/m640F 38 00 /rShuffle bytes
PMULHRSW mm1, mm2/m640F 38 0B /rMultiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits
PMADDUBSW mm1, mm2/m640F 38 04 /rMultiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words
PHSUBW mm1, mm2/m640F 38 05 /rSubtract and pack 16-bit signed integers horizontally
PHSUBSW mm1, mm2/m640F 38 07 /rSubtract and pack 16-bit signed integer horizontally with saturation
PHSUBD mm1, mm2/m640F 38 06 /rSubtract and pack 32-bit signed integers horizontally
PHADDSW mm1, mm2/m640F 38 03 /rAdd and pack 16-bit signed integers horizontally, pack saturated integers to mm1.
PHADDW mm1, mm2/m640F 38 01 /rAdd and pack 16-bit integers horizontally
PHADDD mm1, mm2/m640F 38 02 /rAdd and pack 32-bit integers horizontally
PALIGNR mm1, mm2/m64, imm80F 3A 0F /r ibConcatenate destination and source operands, extract byte-aligned result shifted to the right
PABSB mm1, mm2/m640F 38 1C /rCompute the absolute value of bytes and store unsigned result
PABSW mm1, mm2/m640F 38 1D /rCompute the absolute value of 16-bit integers and store unsigned result
PABSD mm1, mm2/m640F 38 1E /rCompute the absolute value of 32-bit integers and store unsigned result

SSE instructions

Added with Pentium III

SSE instructions operate on xmm registers, which are 128 bit wide.

SSE consists of the following SSE SIMD floating-point instructions:

InstructionOpcodeMeaning
ANDPS* xmm1, xmm2/m1280F 54 /rBitwise Logical AND of Packed Single-Precision Floating-Point Values
ANDNPS* xmm1, xmm2/m1280F 55 /rBitwise Logical AND NOT of Packed Single-Precision Floating-Point Values
ORPS* xmm1, xmm2/m1280F 56 /rBitwise Logical OR of Single-Precision Floating-Point Values
XORPS* xmm1, xmm2/m1280F 57 /rBitwise Logical XOR for Single-Precision Floating-Point Values
MOVUPS xmm1, xmm2/m1280F 10 /rMove Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm1, xmm2/m32F3 0F 10 /rMove Scalar Single-Precision Floating-Point Values
MOVUPS xmm2/m128, xmm10F 11 /rMove Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm2/m32, xmm1F3 0F 11 /rMove Scalar Single-Precision Floating-Point Values
MOVLPS xmm, m640F 12 /rMove Low Packed Single-Precision Floating-Point Values
MOVHLPS xmm1, xmm20F 12 /rMove Packed Single-Precision Floating-Point Values High to Low
MOVLPS m64, xmm0F 13 /rMove Low Packed Single-Precision Floating-Point Values
UNPCKLPS xmm1, xmm2/m1280F 14 /rUnpack and Interleave Low Packed Single-Precision Floating-Point Values
UNPCKHPS xmm1, xmm2/m1280F 15 /rUnpack and Interleave High Packed Single-Precision Floating-Point Values
MOVHPS xmm, m640F 16 /rMove High Packed Single-Precision Floating-Point Values
MOVLHPS xmm1, xmm20F 16 /rMove Packed Single-Precision Floating-Point Values Low to High
MOVHPS m64, xmm0F 17 /rMove High Packed Single-Precision Floating-Point Values
MOVAPS xmm1, xmm2/m1280F 28 /rMove Aligned Packed Single-Precision Floating-Point Values
MOVAPS xmm2/m128, xmm10F 29 /rMove Aligned Packed Single-Precision Floating-Point Values
MOVNTPS m128, xmm10F 2B /rMove Aligned Four Packed Single-FP Non Temporal
MOVMSKPS reg, xmm0F 50 /rExtract Packed Single-Precision Floating-Point 4-bit Sign Mask. The upper bits of the register are filled with zeros.
CVTPI2PS xmm, mm/m640F 2A /rConvert Packed Dword Integers to Packed Single-Precision FP Values
CVTSI2SS xmm, r/m32F3 0F 2A /rConvert Dword Integer to Scalar Single-Precision FP Value
CVTSI2SS xmm, r/m64F3 REX.W 0F 2A /rConvert Qword Integer to Scalar Single-Precision FP Value
CVTTPS2PI mm, xmm/m640F 2C /rConvert with Truncation Packed Single-Precision FP Values to Packed Dword Integers
CVTTSS2SI r32, xmm/m32F3 0F 2C /rConvert with Truncation Scalar Single-Precision FP Value to Dword Integer
CVTTSS2SI r64, xmm1/m32F3 REX.W 0F 2C /rConvert with Truncation Scalar Single-Precision FP Value to Qword Integer
CVTPS2PI mm, xmm/m640F 2D /rConvert Packed Single-Precision FP Values to Packed Dword Integers
CVTSS2SI r32, xmm/m32F3 0F 2D /rConvert Scalar Single-Precision FP Value to Dword Integer
CVTSS2SI r64, xmm1/m32F3 REX.W 0F 2D /rConvert Scalar Single-Precision FP Value to Qword Integer
UCOMISS xmm1, xmm2/m320F 2E /rUnordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS
COMISS xmm1, xmm2/m320F 2F /rCompare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS
SQRTPS xmm1, xmm2/m1280F 51 /rCompute Square Roots of Packed Single-Precision Floating-Point Values
SQRTSS xmm1, xmm2/m32F3 0F 51 /rCompute Square Root of Scalar Single-Precision Floating-Point Value
RSQRTPS xmm1, xmm2/m1280F 52 /rCompute Reciprocal of Square Root of Packed Single-Precision Floating-Point Value
RSQRTSS xmm1, xmm2/m32F3 0F 52 /rCompute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value
RCPPS xmm1, xmm2/m1280F 53 /rCompute Reciprocal of Packed Single-Precision Floating-Point Values
RCPSS xmm1, xmm2/m32F3 0F 53 /rCompute Reciprocal of Scalar Single-Precision Floating-Point Values
ADDPS xmm1, xmm2/m1280F 58 /rAdd Packed Single-Precision Floating-Point Values
ADDSS xmm1, xmm2/m32F3 0F 58 /rAdd Scalar Single-Precision Floating-Point Values
MULPS xmm1, xmm2/m1280F 59 /rMultiply Packed Single-Precision Floating-Point Values
MULSS xmm1, xmm2/m32F3 0F 59 /rMultiply Scalar Single-Precision Floating-Point Values
SUBPS xmm1, xmm2/m1280F 5C /rSubtract Packed Single-Precision Floating-Point Values
SUBSS xmm1, xmm2/m32F3 0F 5C /rSubtract Scalar Single-Precision Floating-Point Values
MINPS xmm1, xmm2/m1280F 5D /rReturn Minimum Packed Single-Precision Floating-Point Values
MINSS xmm1, xmm2/m32F3 0F 5D /rReturn Minimum Scalar Single-Precision Floating-Point Values
DIVPS xmm1, xmm2/m1280F 5E /rDivide Packed Single-Precision Floating-Point Values
DIVSS xmm1, xmm2/m32F3 0F 5E /rDivide Scalar Single-Precision Floating-Point Values
MAXPS xmm1, xmm2/m1280F 5F /rReturn Maximum Packed Single-Precision Floating-Point Values
MAXSS xmm1, xmm2/m32F3 0F 5F /rReturn Maximum Scalar Single-Precision Floating-Point Values
LDMXCSR m320F AE /2Load MXCSR Register State
STMXCSR m320F AE /3Store MXCSR Register State
CMPPS xmm1, xmm2/m128, imm80F C2 /r ibCompare Packed Single-Precision Floating-Point Values
CMPSS xmm1, xmm2/m32, imm8F3 0F C2 /r ibCompare Scalar Single-Precision Floating-Point Values
SHUFPS xmm1, xmm2/m128, imm80F C6 /r ibShuffle Packed Single-Precision Floating-Point Values

* The floating point single bitwise operations ANDPS, ANDNPS, ORPS and XORPS produce the same result as the SSE2 integer (PAND, PANDN, POR, PXOR) and double ones (ANDPD, ANDNPD, ORPD, XORPD), but can introduce extra latency for domain changes when applied values of the wrong type. [1]

SSE2 instructions

Added with Pentium 4

SSE2 SIMD floating-point instructions

SSE2 data movement instructions
InstructionOpcodeMeaning
MOVAPD xmm1, xmm2/m12866 0F 28 /rMove Aligned Packed Double-Precision Floating-Point Values
MOVAPD xmm2/m128, xmm166 0F 29 /rMove Aligned Packed Double-Precision Floating-Point Values
MOVNTPD m128, xmm166 0F 2B /rStore Packed Double-Precision Floating-Point Values Using Non-Temporal Hint
MOVHPD xmm1, m6466 0F 16 /rMove High Packed Double-Precision Floating-Point Value
MOVHPD m64, xmm166 0F 17 /rMove High Packed Double-Precision Floating-Point Value
MOVLPD xmm1, m6466 0F 12 /rMove Low Packed Double-Precision Floating-Point Value
MOVLPD m64, xmm166 0F 13/rMove Low Packed Double-Precision Floating-Point Value
MOVUPD xmm1, xmm2/m12866 0F 10 /rMove Unaligned Packed Double-Precision Floating-Point Values
MOVUPD xmm2/m128, xmm166 0F 11 /rMove Unaligned Packed Double-Precision Floating-Point Values
MOVMSKPD reg, xmm66 0F 50 /rExtract Packed Double-Precision Floating-Point Sign Mask
MOVSD* xmm1, xmm2/m64F2 0F 10 /rMove or Merge Scalar Double-Precision Floating-Point Value
MOVSD xmm1/m64, xmm2F2 0F 11 /rMove or Merge Scalar Double-Precision Floating-Point Value
SSE2 packed arithmetic instructions
InstructionOpcodeMeaning
ADDPD xmm1, xmm2/m12866 0F 58 /rAdd Packed Double-Precision Floating-Point Values
ADDSD xmm1, xmm2/m64F2 0F 58 /rAdd Low Double-Precision Floating-Point Value
DIVPD xmm1, xmm2/m12866 0F 5E /rDivide Packed Double-Precision Floating-Point Values
DIVSD xmm1, xmm2/m64F2 0F 5E /rDivide Scalar Double-Precision Floating-Point Value
MAXPD xmm1, xmm2/m12866 0F 5F /rMaximum of Packed Double-Precision Floating-Point Values
MAXSD xmm1, xmm2/m64F2 0F 5F /rReturn Maximum Scalar Double-Precision Floating-Point Value
MINPD xmm1, xmm2/m12866 0F 5D /rMinimum of Packed Double-Precision Floating-Point Values
MINSD xmm1, xmm2/m64F2 0F 5D /rReturn Minimum Scalar Double-Precision Floating-Point Value
MULPD xmm1, xmm2/m12866 0F 59 /rMultiply Packed Double-Precision Floating-Point Values
MULSD xmm1,xmm2/m64F2 0F 59 /rMultiply Scalar Double-Precision Floating-Point Value
SQRTPD xmm1, xmm2/m12866 0F 51 /rSquare Root of Double-Precision Floating-Point Values
SQRTSD xmm1,xmm2/m64F2 0F 51/rCompute Square Root of Scalar Double-Precision Floating-Point Value
SUBPD xmm1, xmm2/m12866 0F 5C /rSubtract Packed Double-Precision Floating-Point Values
SUBSD xmm1, xmm2/m64F2 0F 5C /rSubtract Scalar Double-Precision Floating-Point Value
SSE2 logical instructions
InstructionOpcodeMeaning
ANDPD xmm1, xmm2/m12866 0F 54 /rBitwise Logical AND of Packed Double Precision Floating-Point Values
ANDNPD xmm1, xmm2/m12866 0F 55 /rBitwise Logical AND NOT of Packed Double Precision Floating-Point Values
ORPD xmm1, xmm2/m12866 0F 56/rBitwise Logical OR of Packed Double Precision Floating-Point Values
XORPD xmm1, xmm2/m12866 0F 57/rBitwise Logical XOR of Packed Double Precision Floating-Point Values
SSE2 compare instructions
InstructionOpcodeMeaning
CMPPD xmm1, xmm2/m128, imm866 0F C2 /r ibCompare Packed Double-Precision Floating-Point Values
CMPSD* xmm1, xmm2/m64, imm8F2 0F C2 /r ibCompare Low Double-Precision Floating-Point Values
COMISD xmm1, xmm2/m6466 0F 2F /rCompare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS
UCOMISD xmm1, xmm2/m6466 0F 2E /rUnordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS
SSE2 shuffle and unpack instructions
InstructionOpcodeMeaning
SHUFPD xmm1, xmm2/m128, imm866 0F C6 /r ibPacked Interleave Shuffle of Pairs of Double-Precision Floating-Point Values
UNPCKHPD xmm1, xmm2/m12866 0F 15 /rUnpack and Interleave High Packed Double-Precision Floating-Point Values
UNPCKLPD xmm1, xmm2/m12866 0F 14 /rUnpack and Interleave Low Packed Double-Precision Floating-Point Values
SSE2 conversion instructions
InstructionOpcodeMeaning
CVTDQ2PD xmm1, xmm2/m64F3 0F E6 /rConvert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values
CVTDQ2PS xmm1, xmm2/m1280F 5B /rConvert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values
CVTPD2DQ xmm1, xmm2/m128F2 0F E6 /rConvert Packed Double-Precision Floating-Point Values to Packed Doubleword Integers
CVTPD2PI mm, xmm/m12866 0F 2D /rConvert Packed Double-Precision FP Values to Packed Dword Integers
CVTPD2PS xmm1, xmm2/m12866 0F 5A /rConvert Packed Double-Precision Floating-Point Values to Packed Single-Precision Floating-Point Values
CVTPI2PD xmm, mm/m6466 0F 2A /rConvert Packed Dword Integers to Packed Double-Precision FP Values
CVTPS2DQ xmm1, xmm2/m12866 0F 5B /rConvert Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values
CVTPS2PD xmm1, xmm2/m640F 5A /rConvert Packed Single-Precision Floating-Point Values to Packed Double-Precision Floating-Point Values
CVTSD2SI r32, xmm1/m64F2 0F 2D /rConvert Scalar Double-Precision Floating-Point Value to Doubleword Integer
CVTSD2SI r64, xmm1/m64F2 REX.W 0F 2D /rConvert Scalar Double-Precision Floating-Point Value to Quadword Integer With Sign Extension
CVTSD2SS xmm1, xmm2/m64F2 0F 5A /rConvert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value
CVTSI2SD xmm1, r32/m32F2 0F 2A /rConvert Doubleword Integer to Scalar Double-Precision Floating-Point Value
CVTSI2SD xmm1, r/m64F2 REX.W 0F 2A /rConvert Quadword Integer to Scalar Double-Precision Floating-Point value
CVTSS2SD xmm1, xmm2/m32F3 0F 5A /rConvert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value
CVTTPD2DQ xmm1, xmm2/m12866 0F E6 /rConvert with Truncation Packed Double-Precision Floating-Point Values to Packed Doubleword Integers
CVTTPD2PI mm, xmm/m12866 0F 2C /rConvert with Truncation Packed Double-Precision FP Values to Packed Dword Integers
CVTTPS2DQ xmm1, xmm2/m128F3 0F 5B /rConvert with Truncation Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values
CVTTSD2SI r32, xmm1/m64F2 0F 2C /rConvert with Truncation Scalar Double-Precision Floating-Point Value to Signed Dword Integer
CVTTSD2SI r64, xmm1/m64F2 REX.W 0F 2C /rConvert with Truncation Scalar Double-Precision Floating-Point Value To Signed Qword Integer
  • CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings. Assemblers disambiguate them based on the presence or absence of operands.

SSE2 SIMD integer instructions

SSE2 MMX-like instructions extended to SSE registers

SSE2 allows execution of MMX instructions on SSE registers, processing twice the amount of data at once.

InstructionOpcodeMeaning
MOVD xmm, r/m3266 0F 6E /rMove doubleword
MOVD r/m32, xmm66 0F 7E /rMove doubleword
MOVQ xmm1, xmm2/m64F3 0F 7E /rMove quadword
MOVQ xmm2/m64, xmm166 0F D6 /rMove quadword
MOVQ r/m64, xmm66 REX.W 0F 7E /rMove quadword
MOVQ xmm, r/m6466 REX.W 0F 6E /rMove quadword
PMOVMSKB reg, xmm66 0F D7 /rMove a byte mask, zeroing the upper bits of the register
PEXTRW reg, xmm, imm866 0F C5 /r ibExtract specified word and move it to reg, setting bits 15-0 and zeroing the rest
PINSRW xmm, r32/m16, imm866 0F C4 /r ibMove low word at the specified word position
PACKSSDW xmm1, xmm2/m12866 0F 6B /rConverts 4 packed signed doubleword integers into 8 packed signed word integers with saturation
PACKSSWB xmm1, xmm2/m12866 0F 63 /rConverts 8 packed signed word integers into 16 packed signed byte integers with saturation
PACKUSWB xmm1, xmm2/m12866 0F 67 /rConverts 8 signed word integers into 16 unsigned byte integers with saturation
PADDB xmm1, xmm2/m12866 0F FC /rAdd packed byte integers
PADDW xmm1, xmm2/m12866 0F FD /rAdd packed word integers
PADDD xmm1, xmm2/m12866 0F FE /rAdd packed doubleword integers
PADDQ xmm1, xmm2/m12866 0F D4 /rAdd packed quadword integers.
PADDSB xmm1, xmm2/m12866 0F EC /rAdd packed signed byte integers with saturation
PADDSW xmm1, xmm2/m12866 0F ED /rAdd packed signed word integers with saturation
PADDUSB xmm1, xmm2/m12866 0F DC /rAdd packed unsigned byte integers with saturation
PADDUSW xmm1, xmm2/m12866 0F DD /rAdd packed unsigned word integers with saturation
PAND xmm1, xmm2/m12866 0F DB /rBitwise AND
PANDN xmm1, xmm2/m12866 0F DF /rBitwise AND NOT
POR xmm1, xmm2/m12866 0F EB /rBitwise OR
PXOR xmm1, xmm2/m12866 0F EF /rBitwise XOR
PCMPEQB xmm1, xmm2/m12866 0F 74 /rCompare packed bytes for equality.
PCMPEQW xmm1, xmm2/m12866 0F 75 /rCompare packed words for equality.
PCMPEQD xmm1, xmm2/m12866 0F 76 /rCompare packed doublewords for equality.
PCMPGTB xmm1, xmm2/m12866 0F 64 /rCompare packed signed byte integers for greater than
PCMPGTW xmm1, xmm2/m12866 0F 65 /rCompare packed signed word integers for greater than
PCMPGTD xmm1, xmm2/m12866 0F 66 /rCompare packed signed doubleword integers for greater than
PMULLW xmm1, xmm2/m12866 0F D5 /rMultiply packed signed word integers with saturation
PMULHW xmm1, xmm2/m12866 0F E5 /rMultiply the packed signed word integers, store the high 16 bits of the results
PMULHUW xmm1, xmm2/m12866 0F E4 /rMultiply packed unsigned word integers, store the high 16 bits of the results
PMULUDQ xmm1, xmm2/m12866 0F F4 /rMultiply packed unsigned doubleword integers
PSLLW xmm1, xmm2/m12866 0F F1 /rShift words left while shifting in 0s
PSLLW xmm1, imm866 0F 71 /6 ibShift words left while shifting in 0s
PSLLD xmm1, xmm2/m12866 0F F2 /rShift doublewords left while shifting in 0s
PSLLD xmm1, imm866 0F 72 /6 ibShift doublewords left while shifting in 0s
PSLLQ xmm1, xmm2/m12866 0F F3 /rShift quadwords left while shifting in 0s
PSLLQ xmm1, imm866 0F 73 /6 ibShift quadwords left while shifting in 0s
PSRAD xmm1, xmm2/m12866 0F E2 /rShift doubleword right while shifting in sign bits
PSRAD xmm1, imm866 0F 72 /4 ibShift doublewords right while shifting in sign bits
PSRAW xmm1, xmm2/m12866 0F E1 /rShift words right while shifting in sign bits
PSRAW xmm1, imm866 0F 71 /4 ibShift words right while shifting in sign bits
PSRLW xmm1, xmm2/m12866 0F D1 /rShift words right while shifting in 0s
PSRLW xmm1, imm866 0F 71 /2 ibShift words right while shifting in 0s
PSRLD xmm1, xmm2/m12866 0F D2 /rShift doublewords right while shifting in 0s
PSRLD xmm1, imm866 0F 72 /2 ibShift doublewords right while shifting in 0s
PSRLQ xmm1, xmm2/m12866 0F D3 /rShift quadwords right while shifting in 0s
PSRLQ xmm1, imm866 0F 73 /2 ibShift quadwords right while shifting in 0s
PSUBB xmm1, xmm2/m12866 0F F8 /rSubtract packed byte integers
PSUBW xmm1, xmm2/m12866 0F F9 /rSubtract packed word integers
PSUBD xmm1, xmm2/m12866 0F FA /rSubtract packed doubleword integers
PSUBQ xmm1, xmm2/m12866 0F FB /rSubtract packed quadword integers.
PSUBSB xmm1, xmm2/m12866 0F E8 /rSubtract packed signed byte integers with saturation
PSUBSW xmm1, xmm2/m12866 0F E9 /rSubtract packed signed word integers with saturation
PMADDWD xmm1, xmm2/m12866 0F F5 /rMultiply the packed word integers, add adjacent doubleword results
PSUBUSB xmm1, xmm2/m12866 0F D8 /rSubtract packed unsigned byte integers with saturation
PSUBUSW xmm1, xmm2/m12866 0F D9 /rSubtract packed unsigned word integers with saturation
PUNPCKHBW xmm1, xmm2/m12866 0F 68 /rUnpack and interleave high-order bytes
PUNPCKHWD xmm1, xmm2/m12866 0F 69 /rUnpack and interleave high-order words
PUNPCKHDQ xmm1, xmm2/m12866 0F 6A /rUnpack and interleave high-order doublewords
PUNPCKLBW xmm1, xmm2/m12866 0F 60 /rInterleave low-order bytes
PUNPCKLWD xmm1, xmm2/m12866 0F 61 /rInterleave low-order words
PUNPCKLDQ xmm1, xmm2/m12866 0F 62 /rInterleave low-order doublewords
PAVGB xmm1, xmm2/m12866 0F E0, /rAverage packed unsigned byte integers with rounding
PAVGW xmm1, xmm2/m12866 0F E3 /rAverage packed unsigned word integers with rounding
PMINUB xmm1, xmm2/m12866 0F DA /rCompare packed unsigned byte integers and store packed minimum values
PMINSW xmm1, xmm2/m12866 0F EA /rCompare packed signed word integers and store packed minimum values
PMAXSW xmm1, xmm2/m12866 0F EE /rCompare packed signed word integers and store maximum packed values
PMAXUB xmm1, xmm2/m12866 0F DE /rCompare packed unsigned byte integers and store packed maximum values
PSADBW xmm1, xmm2/m12866 0F F6 /rComputes the absolute differences of the packed unsigned byte integers; the 8 low differences and 8 high differences are then summed separately to produce two unsigned word integer results
SSE2 integer instructions for SSE registers only

The following instructions can be used only on SSE registers, since by their nature they do not work on MMX registers

InstructionOpcodeMeaning
MASKMOVDQU xmm1, xmm266 0F F7 /rNon-Temporal Store of Selected Bytes from an XMM Register into Memory
MOVDQ2Q mm, xmmF2 0F D6 /rMove low quadword from XMM to MMX register.
MOVDQA xmm1, xmm2/m12866 0F 6F /rMove aligned double quadword
MOVDQA xmm2/m128, xmm166 0F 7F /rMove aligned double quadword
MOVDQU xmm1, xmm2/m128F3 0F 6F /rMove unaligned double quadword
MOVDQU xmm2/m128, xmm1F3 0F 7F /rMove unaligned double quadword
MOVQ2DQ xmm, mmF3 0F D6 /rMove quadword from MMX register to low quadword of XMM register
MOVNTDQ m128, xmm166 0F E7 /rStore Packed Integers Using Non-Temporal Hint
PSHUFHW xmm1, xmm2/m128, imm8F3 0F 70 /r ibShuffle packed high words.
PSHUFLW xmm1, xmm2/m128, imm8F2 0F 70 /r ibShuffle packed low words.
PSHUFD xmm1, xmm2/m128, imm866 0F 70 /r ibShuffle packed doublewords.
PSLLDQ xmm1, imm866 0F 73 /7 ibPacked shift left logical double quadwords.
PSRLDQ xmm1, imm866 0F 73 /3 ibPacked shift right logical double quadwords.
PUNPCKHQDQ xmm1, xmm2/m12866 0F 6D /rUnpack and interleave high-order quadwords,
PUNPCKLQDQ xmm1, xmm2/m12866 0F 6C /rInterleave low quadwords,

SSE3 instructions

Added with Pentium 4 supporting SSE3

SSE3 SIMD floating-point instructions

InstructionOpcodeMeaningNotes
ADDSUBPS xmm1, xmm2/m128F2 0F D0 /rAdd/subtract single-precision floating-point valuesfor Complex Arithmetic
ADDSUBPD xmm1, xmm2/m12866 0F D0 /rAdd/subtract double-precision floating-point values
MOVDDUP xmm1, xmm2/m64F2 0F 12 /rMove double-precision floating-point value and duplicate
MOVSLDUP xmm1, xmm2/m128F3 0F 12 /rMove and duplicate even index single-precision floating-point values
MOVSHDUP xmm1, xmm2/m128F3 0F 16 /rMove and duplicate odd index single-precision floating-point values
HADDPS xmm1, xmm2/m128F2 0F 7C /rHorizontal add packed single-precision floating-point valuesfor Graphics
HADDPD xmm1, xmm2/m12866 0F 7C /rHorizontal add packed double-precision floating-point values
HSUBPS xmm1, xmm2/m128F2 0F 7D /rHorizontal subtract packed single-precision floating-point values
HSUBPD xmm1, xmm2/m12866 0F 7D /rHorizontal subtract packed double-precision floating-point values

SSE3 SIMD integer instructions

InstructionOpcodeMeaningNotes
LDDQU xmm1, memF2 0F F0 /rLoad unaligned data and return double quadwordInstructionally equivalent to MOVDQU. For video encoding

SSSE3 instructions

Added with Xeon 5100 series and initial Core 2

The following MMX-like instructions extended to SSE registers were added with SSSE3

InstructionOpcodeMeaning
PSIGNB xmm1, xmm2/m12866 0F 38 08 /rNegate/zero/preserve packed byte integers depending on corresponding sign
PSIGNW xmm1, xmm2/m12866 0F 38 09 /rNegate/zero/preserve packed word integers depending on corresponding sign
PSIGND xmm1, xmm2/m12866 0F 38 0A /rNegate/zero/preserve packed doubleword integers depending on corresponding
PSHUFB xmm1, xmm2/m12866 0F 38 00 /rShuffle bytes
PMULHRSW xmm1, xmm2/m12866 0F 38 0B /rMultiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits
PMADDUBSW xmm1, xmm2/m12866 0F 38 04 /rMultiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words
PHSUBW xmm1, xmm2/m12866 0F 38 05 /rSubtract and pack 16-bit signed integers horizontally
PHSUBSW xmm1, xmm2/m12866 0F 38 07 /rSubtract and pack 16-bit signed integer horizontally with saturation
PHSUBD xmm1, xmm2/m12866 0F 38 06 /rSubtract and pack 32-bit signed integers horizontally
PHADDSW xmm1, xmm2/m12866 0F 38 03 /rAdd and pack 16-bit signed integers horizontally with saturation
PHADDW xmm1, xmm2/m12866 0F 38 01 /rAdd and pack 16-bit integers horizontally
PHADDD xmm1, xmm2/m12866 0F 38 02 /rAdd and pack 32-bit integers horizontally
PALIGNR xmm1, xmm2/m128, imm866 0F 3A 0F /r ibConcatenate destination and source operands, extract byte-aligned result shifted to the right
PABSB xmm1, xmm2/m12866 0F 38 1C /rCompute the absolute value of bytes and store unsigned result
PABSW xmm1, xmm2/m12866 0F 38 1D /rCompute the absolute value of 16-bit integers and store unsigned result
PABSD xmm1, xmm2/m12866 0F 38 1E /rCompute the absolute value of 32-bit integers and store unsigned result

SSE4 instructions

SSE4.1

Added with Core 2 manufactured in 45nm

SSE4.1 SIMD floating-point instructions
InstructionOpcodeMeaning
DPPS xmm1, xmm2/m128, imm866 0F 3A 40 /r ibSelectively multiply packed SP floating-point values, add and selectively store
DPPD xmm1, xmm2/m128, imm866 0F 3A 41 /r ibSelectively multiply packed DP floating-point values, add and selectively store
BLENDPS xmm1, xmm2/m128, imm866 0F 3A 0C /r ibSelect packed single precision floating-point values from specified mask
BLENDVPS xmm1, xmm2/m128, <XMM0>66 0F 38 14 /rSelect packed single precision floating-point values from specified mask
BLENDPD xmm1, xmm2/m128, imm866 0F 3A 0D /r ibSelect packed DP-FP values from specified mask
BLENDVPD xmm1, xmm2/m128, <XMM0>66 0F 38 15 /rSelect packed DP FP values from specified mask
ROUNDPS xmm1, xmm2/m128, imm866 0F 3A 08 /r ibRound packed single precision floating-point values
ROUNDSS xmm1, xmm2/m32, imm866 0F 3A 0A /r ibRound the low packed single precision floating-point value
ROUNDPD xmm1, xmm2/m128, imm866 0F 3A 09 /r ibRound packed double precision floating-point values
ROUNDSD xmm1, xmm2/m64, imm866 0F 3A 0B /r ibRound the low packed double precision floating-point value
INSERTPS xmm1, xmm2/m32, imm866 0F 3A 21 /r ibInsert a selected single-precision floating-point value at the specified destination element and zero out destination elements
EXTRACTPS reg/m32, xmm1, imm866 0F 3A 17 /r ibExtract one single-precision floating-point value at specified offset and store the result (zero-extended, if applicable)
SSE4.1 SIMD integer instructions
InstructionOpcodeMeaning
MPSADBW xmm1, xmm2/m128, imm866 0F 3A 42 /r ibSums absolute 8-bit integer difference of adjacent groups of 4 byte integers with starting offset
PHMINPOSUW xmm1, xmm2/m12866 0F 38 41 /rFind the minimum unsigned word
PMULLD xmm1, xmm2/m12866 0F 38 40 /rMultiply the packed dword signed integers and store the low 32 bits
PMULDQ xmm1, xmm2/m12866 0F 38 28 /rMultiply packed signed doubleword integers and store quadword result
PBLENDVB xmm1, xmm2/m128, <XMM0>66 0F 38 10 /rSelect byte values from specified mask
PBLENDW xmm1, xmm2/m128, imm866 0F 3A 0E /r ibSelect words from specified mask
PMINSB xmm1, xmm2/m12866 0F 38 38 /rCompare packed signed byte integers
PMINUW xmm1, xmm2/m12866 0F 38 3A/rCompare packed unsigned word integers
PMINSD xmm1, xmm2/m12866 0F 38 39 /rCompare packed signed dword integers
PMINUD xmm1, xmm2/m12866 0F 38 3B /rCompare packed unsigned dword integers
PMAXSB xmm1, xmm2/m12866 0F 38 3C /rCompare packed signed byte integers
PMAXUW xmm1, xmm2/m12866 0F 38 3E/rCompare packed unsigned word integers
PMAXSD xmm1, xmm2/m12866 0F 38 3D /rCompare packed signed dword integers
PMAXUD xmm1, xmm2/m12866 0F 38 3F /rCompare packed unsigned dword integers
PINSRB xmm1, r32/m8, imm866 0F 3A 20 /r ibInsert a byte integer value at specified destination element
PINSRD xmm1, r/m32, imm866 0F 3A 22 /r ibInsert a dword integer value at specified destination element
PINSRQ xmm1, r/m64, imm866 REX.W 0F 3A 22 /r ibInsert a qword integer value at specified destination element
PEXTRB reg/m8, xmm2, imm866 0F 3A 14 /r ibExtract a byte integer value at source byte offset, upper bits are zeroed.
PEXTRW reg/m16, xmm, imm866 0F 3A 15 /r ibExtract word and copy to lowest 16 bits, zero-extended
PEXTRD r/m32, xmm2, imm866 0F 3A 16 /r ibExtract a dword integer value at source dword offset
PEXTRQ r/m64, xmm2, imm866 REX.W 0F 3A 16 /r ibExtract a qword integer value at source qword offset
PMOVSXBW xmm1, xmm2/m6466 0f 38 20 /rSign extend 8 packed 8-bit integers to 8 packed 16-bit integers
PMOVZXBW xmm1, xmm2/m6466 0f 38 30 /rZero extend 8 packed 8-bit integers to 8 packed 16-bit integers
PMOVSXBD xmm1, xmm2/m3266 0f 38 21 /rSign extend 4 packed 8-bit integers to 4 packed 32-bit integers
PMOVZXBD xmm1, xmm2/m3266 0f 38 31 /rZero extend 4 packed 8-bit integers to 4 packed 32-bit integers
PMOVSXBQ xmm1, xmm2/m1666 0f 38 22 /rSign extend 2 packed 8-bit integers to 2 packed 64-bit integers
PMOVZXBQ xmm1, xmm2/m1666 0f 38 32 /rZero extend 2 packed 8-bit integers to 2 packed 64-bit integers
PMOVSXWD xmm1, xmm2/m6466 0f 38 23/rSign extend 4 packed 16-bit integers to 4 packed 32-bit integers
PMOVZXWD xmm1, xmm2/m6466 0f 38 33 /rZero extend 4 packed 16-bit integers to 4 packed 32-bit integers
PMOVSXWQ xmm1, xmm2/m3266 0f 38 24 /rSign extend 2 packed 16-bit integers to 2 packed 64-bit integers
PMOVZXWQ xmm1, xmm2/m3266 0f 38 34 /rZero extend 2 packed 16-bit integers to 2 packed 64-bit integers
PMOVSXDQ xmm1, xmm2/m6466 0f 38 25 /rSign extend 2 packed 32-bit integers to 2 packed 64-bit integers
PMOVZXDQ xmm1, xmm2/m6466 0f 38 35 /rZero extend 2 packed 32-bit integers to 2 packed 64-bit integers
PTEST xmm1, xmm2/m12866 0F 38 17 /rSet ZF if AND result is all 0s, set CF if AND NOT result is all 0s
PCMPEQQ xmm1, xmm2/m12866 0F 38 29 /rCompare packed qwords for equality
PACKUSDW xmm1, xmm2/m12866 0F 38 2B /rConvert 2 × 4 packed signed doubleword integers into 8 packed unsigned word integers with saturation
MOVNTDQA xmm1, m12866 0F 38 2A /rMove double quadword using non-temporal hint if WC memory type

SSE4a

Added with Phenom processors

InstructionOpcodeMeaning
EXTRQ66 0F 78 /0 ib ibExtract Field From Register
66 0F 79 /r
INSERTQF2 0F 78 /r ib ibInsert Field
F2 0F 79 /r
MOVNTSDF2 0F 2B /rMove Non-Temporal Scalar Double-Precision Floating-Point
MOVNTSSF3 0F 2B /rMove Non-Temporal Scalar Single-Precision Floating-Point

SSE4.2

Added with Nehalem processors

InstructionOpcodeMeaning
PCMPESTRI xmm1, xmm2/m128, imm866 0F 3A 61 /r imm8Packed comparison of string data with explicit lengths, generating an index
PCMPESTRM xmm1, xmm2/m128, imm866 0F 3A 60 /r imm8Packed comparison of string data with explicit lengths, generating a mask
PCMPISTRI xmm1, xmm2/m128, imm866 0F 3A 63 /r imm8Packed comparison of string data with implicit lengths, generating an index
PCMPISTRM xmm1, xmm2/m128, imm866 0F 3A 62 /r imm8Packed comparison of string data with implicit lengths, generating a mask
PCMPGTQ xmm1,xmm2/m12866 0F 38 37 /rCompare packed signed qwords for greater than.

F16C

Half-precision floating-point conversion.

InstructionMeaning
VCVTPH2PS xmmreg,xmmrm64Convert four half-precision floating point values in memory or the bottom half of an XMM register to four single-precision floating-point values in an XMM register
VCVTPH2PS ymmreg,xmmrm128Convert eight half-precision floating point values in memory or an XMM register (the bottom half of a YMM register) to eight single-precision floating-point values in a YMM register
VCVTPS2PH xmmrm64,xmmreg,imm8Convert four single-precision floating point values in an XMM register to half-precision floating-point values in memory or the bottom half an XMM register
VCVTPS2PH xmmrm128,ymmreg,imm8Convert eight single-precision floating point values in a YMM register to half-precision floating-point values in memory or an XMM register

AVX

AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.

Vector operations on 256 bit registers.

InstructionDescription
VBROADCASTSSCopy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VBROADCASTSD
VBROADCASTF128
VINSERTF128Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTF128Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VMASKMOVPSConditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. On the AMD Jaguar processor architecture, this instruction with a memory source operand takes more than 300 clock cycles when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw. [2]
VMASKMOVPD
VPERMILPSPermute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes. [3]
VPERMILPD
VPERM2F128Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VZEROALLSet all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use.
VZEROUPPERSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

AVX2

Introduced in Intel's Haswell microarchitecture and AMD's Excavator.

Expansion of most vector integer SSE and AVX instructions to 256 bits

InstructionDescription
VBROADCASTSSCopy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX1. There is no 128-bit version however, but the same effect can be simply achieved using VINSERTF128.
VBROADCASTSD
VPBROADCASTBCopy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register.
VPBROADCASTW
VPBROADCASTD
VPBROADCASTQ
VBROADCASTI128Copy a 128-bit memory operand to all elements of a YMM vector register.
VINSERTI128Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTI128Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VGATHERDPD Gathers single or double precision floating point values using either 32 or 64-bit indices and scale.
VGATHERQPD
VGATHERDPS
VGATHERQPS
VPGATHERDDGathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale.
VPGATHERDQ
VPGATHERQD
VPGATHERQQ
VPMASKMOVDConditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPMASKMOVQ
VPERMPSShuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMD
VPERMPDShuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMQ
VPERM2I128Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VPBLENDDDoubleword immediate version of the PBLEND instructions from SSE4.
VPSLLVDShift left logical. Allows variable shifts where each element is shifted according to the packed input.
VPSLLVQ
VPSRLVDShift right logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRLVQ
VPSRAVDShift right arithmetically. Allows variable shifts where each element is shifted according to the packed input.

FMA3 and FMA4 instructions

Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. (For the FMA4 instructions, no FP16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:


Opcode table
The 10 fused-multiply-add operations and the 110 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

Basic operationOpcode byteFP32 instructionsFP64 instructionsFP16 instructions
Packed alternating multiply-add/subtract
  • (A*B)-C in even-numbered lanes [lower-alpha 1]
  • (A*B)+C in odd-numbered lanes
96VFMADDSUB132PSVFMADDSUB132PDVFMADDSUB132PH
A6VFMADDSUB213PSVFMADDSUB213PDVFMADDSUB213PH
B6VFMADDSUB231PSVFMADDSUB231PDVFMADDSUB231PH
5C/5D*VFMADDSUBPS*VFMADDSUBPD*
Packed alternating multiply-subtract/add
  • (A*B)+C in even-numbered lanes
  • (A*B)-C in odd-numbered lanes
97VFMSUBADD132PSVFMSUBADD132PDVFMSUBADD132PH
A7VFMSUBADD213PSVFMSUBADD213PDVFMSUBADD213PH
B7VFMSUBADD231PSVFMSUBADD231PDVFMSUBADD231PH
5E/5F*VFMSUBADDPS*VFMSUBADDPD*
Packed multiply-add
(A*B)+C
98VFMADD132PSVFMADD132PDVFMADD132PH
A8VFMADD213PSVFMADD213PDVFMADD213PH
B8VFMADD231PSVFMADD231PDVFMADD231PH
68/69*VFMADDPS*VFMADDPD*
Scalar multiply-add
(A*B)+C
99VFMADD132SSVFMADD132SDVFMADD132SH
A9VFMADD213SSVFMADD213SDVFMADD213SH
B9VFMADD231SSVFMADD231SDVFMADD231SH
6A/6B*VFMADDSS*VFMADDSD*
Packed multiply-subtract
(A*B)-C
9AVFMSUB132PSVFMSUB132PDVFMSUB132PH
AAVFMSUB213PSVFMSUB213PDVFMSUB213PH
BAVFMSUB231PSVFMSUB231PDVFMSUB231PH
6C/6D*VFMSUBPS*VFMSUBPD*
Scalar multiply-subtract
(A*B)-C
9BVFMSUB132SSVFMSUB132SDVFMSUB132SH
ABVFMSUB213SSVFMSUB213SDVFMSUB213SH
BBVFMSUB231SSVFMSUB231SDVFMSUB231SH
6E/6F*VFMSUBSS*VFMSUBSD*
Packed negative-multiply-add
(-A*B)+C
9CVFNMADD132PSVFNMADD132PDVFNMADD132PH
ACVFNMADD213PSVFNMADD213PDVFNMADD213PH
BCVFNMADD231PSVFNMADD231PDVFNMADD231PH
78/79*VFMADDPS*VFMADDPD*
Scalar negative-multiply-add
(-A*B)+C
9DVFMADD132SSVFMADD132SDVFMADD132SH
ADVFMADD213SSVFMADD213SDVFMADD213SH
BDVFMADD231SSVFMADD231SDVFMADD231SH
7A/7B*VFMADDSS*VFMADDSD*
Packed negative-multiply-subtract
(-A*B)-C
9EVFNMSUB132PSVFNMSUB132PDVFNMSUB132PH
AEVFNMSUB213PSVFNMSUB213PDVFNMSUB213PH
BEVFNMSUB231PSVFNMSUB231PDVFNMSUB231PH
7C/7D*VFNMSUBPS*VFNMSUBPD*
Scalar negative-multiply-subtract
(-A*B)-C
9FVFNMSUB132SSVFNMSUB132SDVFNMSUB132SH
AFVFNMSUB213SSVFNMSUB213SDVFNMSUB213SH
BFVFNMSUB231SSVFNMSUB231SDVFNMSUB231SH
7E/7F*VFNMSUBSS*VFNMSUBSD*
  1. Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.

AVX-512

AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory. [4] Most of the added instructions may also be used with the 256- and 128-bit registers.

AMX

Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

AMX subsetInstruction mnemonicsOpcodeInstruction descriptionAdded in
AMX-TILE
AMX control and tile management.
LDTILECFG m512VEX.128.NP.0F38.W0 49 /0Load AMX tile configuration data structure from memory as a 64-byte data structure. Sapphire Rapids
STTILECFG m512VEX.128.66.0F38 W0 49 /0Store AMX tile configuration data structure to memory.
TILERELEASEVEX.128.NP.0F38.W0 49 C0Initialize TILECFG and tile data registers (tmm0 to tmm7) to the INIT state (all-zeroes).
TILEZERO tmmVEX.128.F2.0F38.W0 49 /r [lower-alpha 1] Zero out contents of one tile register.
TILELOADD tmm, sibmemVEX.128.F2.0F38.W0 4B /r [lower-alpha 2] Load a data tile from memory into AMX tile register.
TILELOADDT1 tmm, sibmemVEX.128.66.0F38.W0 4B /r [lower-alpha 2] Load a data tile from memory into AMX tile register, with a hint that data should not be kept in the nearest cache levels.
TILESTORED mem, sibtmmVEX.128.F3.0F38.W0 4B /r [lower-alpha 2] Store a data tile to memory from AMX tile register.
AMX-INT8
Matrix multiplication of tiles, with source data interpreted as 8-bit integers and destination data accumulated as 32-bit integers.
TDPBSSD tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.F2.0F38.W0 5E /rMatrix multiply signed bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
TDPBSUD tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.F3.0F38.W0 5E /rMatrix multiply signed bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.
TDPBUSD tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.66.0F38.W0 5E /rMatrix multiply unsigned bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
TDPBUUD tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.NP.0F38.W0 5E /rMatrix multiply unsigned bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.
AMX-BF16
Matrix multiplication of tiles, with source data interpreted as bfloat16 values, and destination data accumulated as FP32 floating-point values.
TDPBF16PS tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.F3.0F38.W0 5C /rMatrix multiply BF16 values from tmm2 with BF16 values from tmm3, accumulating result in tmm1.
AMX-FP16
Matrix multiplication of tiles, with source data interpreted as FP16 values, and destination data accumulated as FP32 floating-point values.
TDPFP16PS tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.F2.0F38.W0 5C /rMatrix multiply FP16 values from tmm2 with FP16 values from tmm3, accumulating result in tmm1.(Granite Rapids)
AMX-COMPLEX
Matrix multiplication of tiles, with source data interpreted as complex numbers represented as pairs of FP16 values, and destination data accumulated as FP32 floating-point values.
TCMMRLFP16PS tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.NP.0F38.W0 6C /rMatrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating real part of result in tmm1.(Granite Rapids D)
TCMMILFP16PS tmm1,tmm2,tmm3 [lower-alpha 3] VEX.128.66.0F38.W0 6C /rMatrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1.
  1. For TILEZERO, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b.
  2. 1 2 3 For the TILELOADD, TILELOADDT1 and TILESTORED instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride.
    These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written to TILECFG.start_row, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
  3. 1 2 3 4 5 6 7 8 For all of the AMX matrix multiply instructions, the three arguments are required to be three different tile registers, or else the instruction will #UD.

See also

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. The 8086 was introduced in 1978 as a fully 16-bit extension of 8-bit Intel's 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">MMX (instruction set)</span> Instruction set designed by Intel

MMX is a single instruction, multiple data (SIMD) instruction set architecture designed by Intel, introduced on January 8, 1997 with its Pentium P5 (microarchitecture) based line of microprocessors, named "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on IA-32 processors by Intel and other vendors as of 1997. AMD also added MMX instruction set in its K6 processor.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector operations using vector registers. This improvement enhances the performance of many graphics-intensive applications. The first microprocessor to implement 3DNow! was the AMD K6-2, introduced in 1998. In appropriate applications, this enhancement raised the speed by about 2–4 times.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.

SSE4 is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation. SSE4 extended the SSE3 instruction set which was released in early 2004. All software using previous Intel SIMD instructions are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.

The SSE5 was a SIMD instruction set extension proposed by AMD on August 30, 2007 as a supplement to the 128-bit SSE core instructions in the AMD64 architecture.

Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge microarchitecture shipping in Q1 2011 and later by AMD with the Bulldozer microarchitecture shipping in Q4 2011. AVX provides new features, new instructions, and a new coding scheme.

The XOP instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12, 2011. However AMD removed support for XOP from Zen (microarchitecture) onward.

The VEX prefix and VEX coding scheme are an extension to the IA-32 and x86-64 instruction set architecture for microprocessors from Intel, AMD and others.

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations. There are two variants:

Open Watcom Assembler or WASM is an x86 assembler produced by Watcom, based on the Watcom Assembler found in Watcom C/C++ compiler and Watcom FORTRAN 77. Further development is being done on the 32- and 64-bit JWASM project, which more closely matches the syntax of Microsoft's assembler.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

The EVEX prefix and corresponding coding scheme is an extension to the 32-bit x86 (IA-32) and 64-bit x86-64 (AMD64) instruction set architecture. EVEX is based on, but should not be confused with the MVEX prefix used by the Knights Corner processor.

References

  1. Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual (order no. 248966-044, June 2021) section 3.5.2.3
  2. "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers" (PDF). Retrieved October 17, 2016.
  3. "Chess programming AVX2". Archived from the original on July 10, 2017. Retrieved October 17, 2016.
  4. "Intel AVX-512 Instructions". Intel. Retrieved 21 June 2022.