X86 SIMD instruction listings

Last updated

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Contents

Summary of SIMD extensions

The main SIMD instruction set extensions that have been introduced for x86 are:

  1. The count of 13 instructions for SSE3 includes the non-SIMD instructions MONITOR and MWAIT that were also introduced as part of "Prescott New Instructions" - these two instructions are considered to be SSE3 instructions by Intel but not by AMD.
  2. On older Zhaoxin processors, such as KX-6000 "LuJiaZui", AVX2 instructions are present but not exposed through CPUID due to the lack of FMA3 support. [1]
  3. Early drafts of the AVX10 specification also added an option for implementations to limit the maximum supported vector-register width to 128/256 bits [2] - however, as of March 2025, this option has been removed, making support for 512-bit vector-register width mandatory again. [3] [4]

MMX instructions and extended variants thereof

These instructions are, unless otherwise noted, available in the following forms:

For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

DescriptionInstruction mnemonicsBasic opcodeMMX
(no prefix)
SSE2
(66h prefix)
AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supportedsubsetlanebcst
Empty MMX technology state.

Mark all the FP/MMX registers as Empty, so that they can be freely used by later x87 code. [a]

EMMS (MMX)0F 77YesNoNo [b] No
Move scalar value from GPR (general-purpose register) or memory to vector register, with zero-fill32-bit(V)MOVD mm, r/m320F 6E /rYesYesYes (L=0,W=0)Yes (L=0,W=0)FNoNo
64-bit
(x86-64)
(V)MOVQ mm, r/m64,
MOVD mm, r/m64 [c]
Yes
(REX.W)
Yes
(REX.W) [d]
Yes (L=0,W=1)Yes (L=0,W=1)FNoNo
Move scalar value from vector register to GPR or memory32-bit(V)MOVD r/m32, mm0F 7E /rYesYesYes (L=0,W=0)Yes (L=0,W=0)FNoNo
64-bit
(x86-64)
(V)MOVQ r/m64, mm,
MOVD r/m64, mm [c]
Yes
(REX.W)
Yes
(REX.W) [d]
Yes (L=0,W=1)Yes (L=0,W=1)FNoNo
Vector move between vector register and either memory or another vector register.

For move to/from memory, the memory address is required to be aligned for (V)MOVDQA variants but not for MOVQ.

The 128-bit VEX-encoded forms of VMOVDQA with a memory argument will, if the memory is cacheable, perform their memory accesses atomically. [e]

MOVQ mm/m64, mm(MMX)
(V)MOVDQA xmm/m128,xmm
0F 7F /rMOVQMOVDQAVMOVDQA [f] VMOVDQA32​(W0)F32No
VMOVDQA64​(W1)F64No
MOVQ mm, mm/m64(MMX)
(V)MOVDQA xmm,xmm/m128
0F 6F /rMOVQMOVDQAVMOVDQA [f] VMOVDQA32​(W0)F32No
VMOVDQA64​(W1)F64No
Pack 32-bit signed integers to 16-bit, with saturation(V)PACKSSDW mm, mm/m64 [g] 0F 6B /rYesYesYesYes (W=0)BW1632
Pack 16-bit signed integers to 8-bit, with saturation(V)PACKSSWB mm, mm/m64 [g] 0F 63 /rYesYesYesYesBW8No
Pack 16-bit unsigned integers to 8-bit, with saturation(V)PACKUSWB mm, mm/m64 [g] 0F 67 /rYesYesYesYesBW8No
Unpack and interleave packed integers from the high halves of two input vectors8-bit(V)PUNPCKHBW mm, mm/m64 [g] 0F 68 /rYesYesYesYesBW8No
16-bit(V)PUNPCKHWD mm, mm/m64 [g] 0F 69 /rYesYesYesYesBW16No
32-bit(V)PUNPCKHDQ mm, mm/m64 [g] 0F 6A /rYesYesYesYes (W=0)F3232
Unpack and interleave packed integers from the low halves of two input vectors8-bit(V)PUNPCKLBW mm, mm/m32 [g] [h] 0F 60 /rYesYesYesYesBW8No
16-bit(V)PUNPCKLWD mm, mm/m32 [g] [h] 0F 61 /rYesYesYesYesBW16No
32-bit(V)PUNPCKLDQ mm, mm/m32 [g] [h] 0F 62 /rYesYesYesYes (W=0)F3232
Add packed integers8-bit(V)PADDB mm, mm/m640F FC /rYesYesYesYesBW8No
16-bit(V)PADDW mm, mm/m640F FD /rYesYesYesYesBW16No
32-bit(V)PADDD mm, mm/m640F FE /rYesYesYesYes (W=0)F3232
Add packed signed integers with saturation8-bit(V)PADDSB mm, mm/m640F EC /rYesYesYesYesBW8No
16-bit(V)PADDSW mm, mm/m640F ED /rYesYesYesYesBW16No
Add packed unsigned integers with saturation8-bit(V)PADDUSB mm, mm/m640F DC /rYesYesYesYesBW8No
16-bit(V)PADDUSW mm, mm/m640F DD /rYesYesYesYesBW16No
Subtract packed integers8-bit(V)PSUBB mm, mm/m640F F8 /rYesYesYesYesBW8No
16-bit(V)PSUBW mm, mm/m640F F9 /rYesYesYesYesBW16No
32-bit(V)PSUBD mm, mm/m640F FA /rYesYesYesYes (W=0)F3232
Subtract packed signed integers with saturation8-bit(V)PSUBSB mm, mm/m640F E8 /rYesYesYesYesBW8No
16-bit(V)PSUBSW mm, mm/m640F E9 /rYesYesYesYesBW16No
Subtract packed unsigned integers with saturation8-bit(V)PSUBUSB mm, mm/m640F D8 /rYesYesYesYesBW8No
16-bit(V)PSUBUSW mm, mm/m640F D9 /rYesYesYesYesBW16No
Compare packed integers for equality8-bit(V)PCMPEQB mm, mm/m640F 74 /rYesYesYesYes [i] BW8No
16-bit(V)PCMPEQW mm, mm/m640F 75 /rYesYesYesYes [i] BW16No
32-bit(V)PCMPEQD mm, mm/m640F 76 /rYesYesYesYes (W=0) [i] F3232
Compare packed integers for signed greater-than8-bit(V)PCMPGTB mm, mm/m640F 64 /rYesYesYesYes [i] BW8No
16-bit(V)PCMPGTW mm, mm/m640F 65 /rYesYesYesYes [i] BW16No
32-bit(V)PCMPGTD mm, mm/m640F 66 /rYesYesYesYes (W=0) [i] F3232
Multiply packed 16-bit signed integers, add results pairwise into 32-bit integers(V)PMADDWD mm, mm/m640F F5 /rYesYesYesYes [j] BW32No
Multiply packed 16-bit signed integers, store high 16 bits of results(V)PMULHW mm, mm/m640F E5 /rYesYesYesYesBW16No
Multiply packed 16-bit integers, store low 16 bits of results(V)PMULLW mm, mm/m640F D5 /rYesYesYesYesBW16No
Vector bitwise AND (V)PAND mm, mm/m640F DB /rYesYesYesVPANDD​(W0)F3232
VPANDQ​(W1)F6464
Vector bitwise AND-NOT(V)PANDN mm, mm/m640F DF /rYesYesYesVPANDND​(W0)F3232
VPANDNQ​(W1)F6464
Vector bitwise OR (V)POR mm, mm/m640F EB /rYesYesYesVPORD(W0)F3232
VPORQ(W1)F6464
Vector bitwise XOR (V)PXOR mm, mm/m640F EE /rYesYesYesVPXORD(W0)F3232
VPXORQ(W1)F6464
Left-shift of packed integers, with common shift-amount16-bit(V)PSLLW mm, imm80F 71 /6 ibYesYesYesYesBW16No
(V)PSLLW mm, mm/m64 [k] 0F F1 /rYesYesYesYesBW16No
32-bit(V)PSLLD mm, imm80F 72 /6 ibYesYesYesYes (W=0)F3232
(V)PSLLD mm, mm/m64 [k] 0F F2 /rYesYesYesYes (W=0)F32No
64-bit(V)PSLLQ mm, imm80F 73 /6 ibYesYesYesYes (W=1)F6464
(V)PSLLQ mm, mm/m64 [k] 0F F3 /rYesYesYesYes (W=1)F64No
Right-shift of packed signed integers, with common shift-amount16-bit(V)PSRAW mm, imm80F 71 /4 ibYesYesYesYesBW16No
(V)PSRAW mm, mm/m64 [k] 0F E1 /rYesYesYesYesBW16No
32-bit(V)PSRAD mm, imm80F 72 /4 ibYesYesYesYes (W=0)F3232
(V)PSRAD mm, mm/m64 [k] 0F E2 /rYesYesYesYes (W=0)F32No
Right-shift of packed unsigned integers, with common shift-amount16-bit(V)PSRLW mm, imm80F 71 /2 ibYesYesYesYesBW16No
(V)PSRLW mm, mm/m64 [k] 0F D1 /rYesYesYesYesBW16No
32-bit(V)PSRLD mm, imm80F 72 /2 ibYesYesYesYes (W=0)F3232
(V)PSRLD mm, mm/m64 [k] 0F D2 /rYesYesYesYes (W=0)F32No
64-bit(V)PSRLQ mm, imm80F 73 /2 ibYesYesYesYes (W=1)F6464
(V)PSRLQ mm, mm/m64 [k] 0F D3 /rYesYesYesYes (W=1)F64No
  1. EMMS will also set the x87 top-of-stack to 0.
    Unlike the older FNINIT instruction, EMMS will not update the FPU Control Word, nor will it update any part of the FPU Status Register other than the top-of-stack. If there are any unmasked pending x87 exceptions, EMMS will raise the exception while FNINIT will clear it.
  2. The 0F 77 opcode can be VEX-encoded (resulting in the AVX VZEROUPPER and VZEROALL instructions), but this requires a VEX.NP prefix, not a VEX.66 prefix.
  3. 1 2 The 64-bit move instruction forms that are encoded by using a REX.W prefix with the 0F 6E and 0F 7E opcodes are listed with different mnemonics in Intel and AMD documentation — MOVQ in Intel documentation [5] and MOVD in AMD documentation. [6]
    This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD.
    This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonic VMOVQ.)
  4. 1 2 The REX.W-encoded variants of MOVQ are available in 64-bit "long mode" only. For SSE2 and later, MOVQ to and from xmm/ymm/zmm registers can also be encoded with F3 0F 7E /r and 66 0F D6 /r respectively - these encodings are shorter and available outside 64-bit mode.
  5. On all Intel, [7] AMD [8] and Zhaoxin [9] processors that support AVX, the 128-bit forms of VMOVDQA (encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.

    (Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable VMOVDQA only.)

    While 128-bit VMOVDQA is atomic, it is not locked — it can be reordered in the same way as normal x86 loads/stores (e.g. loads passing older stores).

    On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as MOVAPS/MOVAPD/MOVDQA are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10. [10]

  6. 1 2 VMOVDQA is available with a vector length of 256 bits under AVX, not requiring AVX2.

    Unlike the 128-bit form, the 256-bit form of VMOVDQA does not provide any special atomicity guarantees.

  7. 1 2 3 4 5 6 7 8 9 For the VPACK* and VPUNPCK* instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  8. 1 2 3 For the memory argument forms of (V)PUNPCKL* instructions, the memory argument is half-width only for the MMX variants of the instructions. For SSE/AVX/AVX-512 variants, the width of the memory argument is the full vector width even though only half of it is actually used.
  9. 1 2 3 4 5 6 The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  10. The (V)PMADDWD instruction will add multiplication results pairwise, but will not add the sum to an accumulator. AVX512_VNNI provides the instructions VDPWSSD and WDPWSSDS, which will add multiplication results pairwise, and then also add them to a per-32-bit-lane accumulator.
  11. 1 2 3 4 5 6 7 8 For the MMX packed shift instructions PSLL* and PSR* with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of 0x80000000_00000000 can be specified and will have the same effect as a shift-amount of 64).

    For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.

    Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (VPSLLV*, VPSRLV*, VPSRAV* instructions).

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

DescriptionInstruction mnemonicsBasic opcodeMMX
(no prefix)
SSE2
(66h prefix)
AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supportedsubsetlanebcst
Added with SSE and MMX+
Perform shuffle of four 16-bit integers in 64-bit vector (MMX) [a] PSHUFW mm,mm/m64,imm8(MMX)0F 70 /r ibPSHUFWPSHUFDVPSHUFDVPSHUFD
(W=0)
F3232
Perform shuffle of four 32-bit integers in 128-bit vector (SSE2)(V)PSHUFD xmm,xmm/m128,imm8 [b]
Insert integer into 16-bit vector register lane(V)PINSRW mm,r32/m16,imm80F C4 /r ibYesYesYes (L=0,W=0 [c] )Yes (L=0)BWNoNo
Extract integer from 16-bit vector register lane, with zero-extension(V)PEXTRW r32,mm,imm8 [d] 0F C5 /r ibYesYesYes (L=0,W=0 [c] )Yes (L=0)BWNoNo
Create a bitmask made from the top bit of each byte in the source vector, and store to integer register(V)PMOVMSKB r32,mm0F D7 /rYesYesYesNo [e]
Minimum-value of packed unsigned 8-bit integers(V)PMINUB mm,mm/m640F DA /rYesYesYesYesBW8No
Maximum-value of packed unsigned 8-bit integers(V)PMAXUB mm,mm/m640F DE /rYesYesYesYesBW8No
Minimum-value of packed signed 16-bit integers(V)PMINSW mm,mm/m640F EA /rYesYesYesYesBW16No
Maximum-value of packed signed 16-bit integers(V)PMAXSW mm,mm/m640F EE /rYesYesYesYesBW16No
Rounded average of packed unsigned integers. The per-lane operation is:
dst ← (src1 + src2 + 1)>>1
8-bit(V)PAVGB mm,mm/m640F E0 /rYesYesYesYesBW8No
16-bit(V)PAVGW mm,mm/m640F E3 /rYesYesYesYesBW16No
Multiply packed 16-bit unsigned integers, store high 16 bits of results(V)PMULHUW mm,mm/mm640F E4 /rYesYesYesYesBW16No
Store vector register to memory using Non-Temporal Hint.

Memory operand required to be aligned for all (V)MOVNTDQ variants, but not for MOVNTQ.

MOVNTQ m64,mm(MMX)
(V)MOVNTDQ m128,xmm
0F E7 /rMOVNTQMOVNTDQVMOVNTDQ [f] VMOVNTDQ
(W=0)
FNoNo
Compute sum of absolute differences for eight 8-bit unsigned integers, storing the result as a 64-bit integer.

For vector widths wider than 64 bits (SSE/AVX/AVX-512), this calculation is done separately for each 64-bit lane of the vectors, producing a vector of 64-bit integers.

(V)PSADBW mm,mm/m640F F6 /rYesYesYesYesBWNoNo
Unaligned store vector register to memory using byte write-mask, with Non-Temporal Hint.

First argument provides data to store, second argument provides byte write-mask (top bit of each byte). [g] Address to store to is given by DS:DI/EDI/RDI (DS: segment overridable with segment-prefix).

MASKMOVQ mm,mm(MMX)
(V)MASKMOVDQU xmm,xmm
0F F7 /rMASKMOVQMASKMOVDQUVMASKMOVDQU
(L=0) [h]
No [i]
Added with SSE2
Multiply packed 32-bit unsigned integers, store full 64-bit result.

The input integers are taken from the low 32 bits of each 64-bit vector lane.

(V)PMULUDQ mm,mm/m640F F4 /rYesYesYesYes (W=1)F6464
Add packed 64-bit integers(V)PADDQ mm, mm/m640F D4 /rYesYesYesYes (W=1)F6464
Subtract packed 64-bit integers(V)PSUBQ mm,mm/m640F FB /rYesYesYesYes (W=1)F6464
Added with SSSE3
Vector Byte Shuffle(V)PSHUFB mm,mm/m64 [b] 0F38 00 /rYesYes [j] YesYesBW8No
Pairwise horizontal add of packed integers16-bit(V)PHADDW mm,mm/mm64 [b] 0F38 01 /rYesYesYesNo
32-bit(V)PHADDD mm,mm/mm64 [b] 0F38 02 /rYesYesYesNo
Pairwise horizontal add of packed 16-bit signed integers, with saturation(V)PHADDSW mm,mm/mm64 [b] 0F38 03 /rYesYesYesNo
Multiply packed 8-bit signed and unsigned integers, add results pairwise into 16-bit signed integers with saturation. First operand is treated as unsigned, second operand as signed.(V)PMADDUBSW mm,mm/m640F38 04 /rYesYesYesYesBW16No
Pairwise horizontal subtract of packed integers.

The higher-order integer of each pair is subtracted from the lower-order integer.

16-bit(V)PHSUBW mm,mm/m64 [b] 0F38 05 /rYesYesYesNo
32-bit(V)PHSUBD mm,mm/m64 [b] 0F38 06 /rYesYesYesNo
Pairwise horizontal subtract of packed 16-bit signed integers, with saturation(V)PHSUBSW mm,mm/m64 [b] 0F38 07 /rYesYesYesNo
Modify packed integers in first source argument based on the sign of packed signed integers in second source argument. The per-lane operation performed is:
if( src2 < 0 ) dst ← -src1 else if( src2 == 0 ) dst ← 0 else dst ← src1
8-bit(V)PSIGNB mm,mm/m640F38 08 /rYesYesYesNo
16-bit(V)PSIGNW mm,mm/m640F38 09 /rYesYesYesNo
32-bit(V)PSIGND mm,mm/m640F38 0A /rYesYesYesNo
Multiply packed 16-bit signed integers, then perform rounding and scaling to produce a 16-bit signed integer result.

The calculation performed per 16-bit lane is:
dst ← (src1*src2 + (1<<14)) >> 15

(V)PMULHRSW mm,mm/m640F38 0B /rYesYesYesYesBW16No
Absolute value of packed signed integers8-bit(V)PABSB mm,mm/m640F38 1C /rYesYesYesYesBW8No
16-bit(V)PABSW mm,mm/m640F38 1D /rYesYesYesYesBW8No
32-bit(V)PABSD mm,mm/m640F38 1E /rPABSDPABSDVPABSDVPABSD(W0)F3232
64-bitVPABSQ xmm,xmm/m128(AVX-512)VPABSQ(W1)F6464
Packed Align Right.

Concatenate two input vectors into a double-size vector, then right-shift by the number of bytes specified by the imm8 argument. The shift-amount is not masked - if the shift-amount is greater than the input vector size, zeroes will be shifted in.

(V)PALIGNR mm,mm/mm64,imm8 [b] 0F3A 0F /r ibYesYesYesYes [k] BW8No
  1. For shuffle of four 16-bit integers in a 64-bit section of a 128-bit XMM register, the SSE2 instructions PSHUFLW (opcode F2 0F 70 /r) or PSHUFHW (opcode F3 0F 70 /r) may be used.
  2. 1 2 3 4 5 6 7 8 9 For the VPSHUFD, VPSHUFB, VPHADD*, VPHSUB* and VPALIGNR instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  3. 1 2 For the VEX-encoded forms of the VPINSRW and VPEXTRW instruction, the Intel SDM (as of rev 084) indicates that the instructions must be encoded with VEX.W=0, however neither Intel XED nor AMD APM indicate any such requirement.
  4. The 0F C5 /r ib variant of PEXTRW allows register destination only. For SSE4.1 and later, a variant that allows a memory destination is available with the opcode 66 0F 3A 15 /r ib.
  5. EVEX-prefixed opcode not available. Under AVX-512, a bitmask made from the top bit of each byte can instead be constructed with the VPMOVB2M instruction, with opcode EVEX.F3.0F38.W0 29 /r, which will store such a bitmask to an opmask register.
  6. VMOVNTDQ is available with a vector length of 256 bits under AVX, not requiring AVX2.
  7. For the MASKMOVQ and (V)MASKMOVDQU instructions, exception and trap behavior for disabled lanes is implementation-dependent. For example, a given implementation may signal a data breakpoint or a page fault for bytes that are zero-masked and not actually written.
  8. For AVX, masked stores to memory are also available using the VMASKMOVPS instruction with opcode VEX.66.0F38 2E /r - unlike VMASKMOVDQU, this instruction allows 256-bit stores without temporal hints, although its mask is coarser - 4 bytes vs 1 byte per lane.
  9. Opcode not available under AVX-512. Under AVX-512, unaligned masked stores to memory (albeit without temporal hints) can be done with the VMOVDQU(8|16|32|64) instructions with opcode EVEX.F2/F3.0F 7F /r, using an opmask register to provide a write mask.
  10. For AVX2 and AVX-512 with vectors wider than 128 bits, the VPSHUFB instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2's VPERMD (shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI's VPERMB (full byte shuffle across 64-byte ZMM register).
  11. For AVX-512, VPALIGNR is supported but will perform its operation within each 128-bit lane. For packed alignment shifts that can shift data across 128-bit lanes, AVX512F's VALIGND instruction may be used, although its shift-amount is specified in units of 32-bits rather than bytes.

SSE instructions and extended variants thereof

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

For the instructions in the below table, the following considerations apply unless otherwise noted:

From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)

Instruction DescriptionBasic opcodeSingle Precision (FP32)Double Precision (FP64)
AVX-512: RC/SAE
Packed (no prefix)Scalar (F3h prefix)Packed (66h prefix)Scalar (F2h prefix)
SSE instructionAVX
(VEX)
AVX-512
(EVEX)
SSE instructionAVX
(VEX) [a]
AVX-512
(EVEX)
SSE2 instructionAVX
(VEX)
AVX-512
(EVEX)
SSE2 instructionAVX
(VEX) [a]
AVX-512
(EVEX)
Unaligned load from memory or vector register0F 10 /rMOVUPS x,x/m128YesYes [b] MOVSS x,x/m32YesYesMOVUPD x,x/m128YesYes [b] MOVSD x,x/m64 [c] YesYesNo
Unaligned store to memory or vector register0F 11 /rMOVUPS x/m128,xYesYes [b] MOVSS x/m32,xYesYesMOVUPD x/m128,xYesYes [b] MOVSD x/m64,x [c] YesYesNo
Load 64 bits from memory or upper half of XMM register into the lower half of XMM register while keeping the upper half unchanged0F 12 /rMOVHLPS x,x(L0) [d] (L0) [d] (MOVSLDUP) [e] MOVLPD x,m64(L0) [d] (L0) [d] (MOVDDUP) [e] No
MOVLPS x,m64(L0) [d] (L0) [d]
Store 64 bits to memory from lower half of XMM register0F 13 /rMOVLPS m64,x(L0) [d] (L0) [d] NoNoNoMOVLPD m64,x(L0) [d] (L0) [d] NoNoNoNo
Unpack and interleave low-order floating-point values0F 14 /rUNPCKLPS x,x/m128Yes [f] Yes [f] NoNoNoUNPCKLPD x,x/m128Yes [f] Yes [f] NoNoNoNo
Unpack and interleave high-order floating-point values0F 15 /rUNPCKHPS x,x/m128Yes [f] Yes [f] NoNoNoUNPCKHPD x,x/m128Yes [f] Yes [f] NoNoNoNo
Load 64 bits from memory or lower half of XMM register into the upper half of XMM register while keeping the lower half unchanged0F 16 /rMOVLHPS x,x(L0) [d] (L0) [d] (MOVSHDUP) [e] MOVHPD x,m64(L0) [d] (L0) [d] NoNoNoNo
MOVHPS x,m64(L0) [d] (L0) [d]
Store 64 bits to memory from upper half of XMM register0F 17 /rMOVHPS m64,x(L0) [d] (L0) [d] NoNoNoMOVHPD m64,x(L0) [d] (L0) [d] NoNoNoNo
Aligned load from memory or vector register0F 28 /rMOVAPS x,x/m128YesYes [b] NoNoNoMOVAPD x,x/m128YesYes [b] NoNoNoNo
Aligned store to memory or vector register0F 29 /rMOVAPS x/m128,xYesYes [b] NoNoNoMOVAPD x/m128,xYesYes [b] NoNoNoNo
Integer to floating-point conversion using general-registers, MMX-registers or memory as source0F 2A /rCVTPI2PS x,mm/m64 [g] NoNoCVTSI2SS x,r/m32
CVTSI2SS x,r/m64
[h]
YesYes [i] CVTPI2PD x,mm/m64 [g] NoNoCVTSI2SD x,r/m32
CVTSI2SD x,r/m64
[h]
YesYes [i] RC
Non-temporal store to memory from vector register.

The packed variants require aligned memory addresses even in VEX/EVEX-encoded forms.

0F 2B /rMOVNTPS m128,xYesYes [i] MOVNTSS m32,x
(AMD SSE4a)
NoNoMOVNTPD m128,xYesYes [i] MOVNTSD m64,x
(AMD SSE4a)
NoNoNo
Floating-point to integer conversion with truncation, using general-purpose registers or MMX-registers as destination0F 2C /rCVTTPS2PI mm,x/m64 [j] NoNoCVTTSS2SI r32,x/m32
CVTTSS2SI r64,x/m32 [k]
YesYes [i] CVTTPD2PI mm,x/m64 [j] NoNoCVTTSD2SI r32,x/m64
CVTTSD2SI r64,x/m64 [k]
YesYes [i] SAE
Floating-point to integer conversion, using general-purpose registers or MMX-registers as destination0F 2D /rCVTPS2PI mm,x/m64 [j] NoNoCVTSS2SI r32,x/m32
CVTSS2SI r64,x/m32 [k]
YesYes [i] CVTPD2PI mm,x/m64 [j] NoNoCVTSD2SI r32,x/m64
CVTSD2SI r64,x/m64 [k]
YesYes [i] RC
Unordered compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

0F 2E /rUCOMISS x,x/m32Yes [a] Yes [i] NoNoNoUCOMISD x,x/m64Yes [a] Yes [i] NoNoNoSAE
Compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

0F 2F /rCOMISS x,x/m32Yes [a] Yes [i] NoNoNoCOMISD x,x/m64Yes [a] Yes [i] NoNoNoSAE
Extract packed floating-point sign mask0F 50 /rMOVMSKPS r32,xYesNo [l] NoNoNoMOVMSKPD r32,xYesNo [l] NoNoNo
Floating-point Square Root0F 51 /rSQRTPS x,x/m128YesYesSQRTSS x,x/m32YesYesSQRTPD x,x/m128YesYesSQRTSD x,x/m64YesYesRC
Reciprocal Square Root Approximation [m] 0F 52 /rRSQRTPS x,x/m128YesNo [n] RSQRTSS x,x/m32YesNo [n] NoNoNo [n] NoNoNo [n]
Reciprocal Approximation [m] 0F 53 /rRCPPS x,x/m128YesNo [o] RCPSS x,x/m32YesNo [o] NoNoNo [o] NoNoNo [o]
Vector bitwise AND0F 54 /rANDPS x,x/m128Yes(DQ) [p] NoNoNoANDPD x,x/m128Yes(DQ) [p] NoNoNoNo
Vector bitwise AND-NOT0F 55 /rANDNPS x,x/m128Yes(DQ) [p] NoNoNoANDNPD x,x/m128Yes(DQ) [p] NoNoNoNo
Vector bitwise OR0F 56 /rORPS x,x/m128Yes(DQ) [p] NoNoNoORPD x,x/m128Yes(DQ) [p] NoNoNoNo
Vector bitwise XOR [q] 0F 57 /rXORPS x,x/m128Yes(DQ) [p] NoNoNoXORPD x,x/m128Yes(DQ) [p] NoNoNoNo
Floating-point Add0F 58 /rADDPS x,x/m128YesYesADDSS x,x/m32YesYesADDPD x,x/m128YesYesADDSD x,x/m64YesYesRC
Floating-point Multiply0F 59 /rMULPS x,x/m128YesYesMULSS x,x/m32YesYesMULPD x,x/m128YesYesMULSD x,x/m64YesYesRC
Convert between floating-point formats
(FP32→FP64, FP64→FP32)
0F 5A /rCVTPS2PD x,x/m64
(SSE2)
YesYes [r] CVTSS2SD x,x/m32
(SSE2)
YesYes [r] CVTPD2PS x,x/m128YesYes [r] CVTSD2SS x,x/m64YesYes [r] SAE,
RC [s]
Floating-point Subtract0F 5C /rSUBPS x,x/m128YesYesSUBSS x,x/m32YesYesSUBPD x,x/m128YesYesSUBSD x,x/m64YesYesRC
Floating-point Minimum Value [t] 0F 5D /rMINPS x,x/m128YesYesMINSS x,x/m32YesYesMINPD x,x/m128YesYesMINSD x,x/m64YesYesSAE
Floating-point Divide0F 5E /rDIVPS x,x/m128YesYesDIVSS x,x/m32YesYesDIVPD x,x/m128YesYesDIVSD x,x/m64YesYesRC
Floating-point Maximum Value [t] 0F 5F /rMAXPS x,x/m128YesYesMAXSS x,x/m32YesYesMAXPD x,x/m128YesYesMAXSD x,x/m64YesYesSAE
Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument. [u] 0F C2 /r ibCMPPS x,x/m128,imm8YesYesCMPSS x,x/m32,imm8YesYesCMPPD x,x/m128,imm8YesYesCMPSD x,x/m64,imm8
[c]
YesYesSAE
Packed Interleaved Shuffle.

Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.

0F C6 /r ibSHUFPS x,x/m128,imm8 [f] YesYesNoNoNoSHUFPD x,x/m128,imm8 [f] YesYesNoNoNoNo
  1. 1 2 3 4 5 6 The VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of V(U)COMISS and V(U)COMISD. (This behavior does not apply to scalar instructions outside this table, such as e.g. VMOVD/VMOVQ, where VEX.L=1 results in an #UD exception.)
  2. 1 2 3 4 5 6 7 8 EVEX-encoded variants of VMOVAPS, VMOVUPS, VMOVAPD and VMOVUPD support opmasks but do not support broadcast.
  3. 1 2 3 The SSE2 MOVSD (MOVe Scalar Double-precision) and CMPSD (CoMPare Scalar Double-precision) instructions have the same names as the older i386 MOVSD (MOVe String Doubleword) and CMPSD (CoMPare String Doubleword) instructions, however their operations are completely unrelated.

    At the assembly language level, they can be distinguished by their use of XMM register operands.

  4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 For variants of VMOVLPS, VMOVHPS, VMOVLPD, VMOVHPD, VMOVLHPS, VMOVHLPS encoded with VEX or EVEX prefixes, the only supported vector length is 128 bits (VEX.L=0 or EVEX.L=0).

    For the EVEX-encoded variants, broadcasts and opmasks are not supported.

  5. 1 2 3 The MOVSLDUP, MOVSHDUP and MOVDDUP instructions are not regularly-encoded scalar SSE1/2 instructions, but instead irregularly-assigned SSE3 vector instructions. For a description of these instructions, see table below.
  6. 1 2 3 4 5 6 7 8 9 10 For the VUNPCK*, VSHUFPS and VSHUFPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that for VSHUFPD, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument).
  7. 1 2 The CVTPI2PS and CVTPI2PD instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.

    For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTDQ2PS (0F 5B /r)
    • CVTDQ2PD (F3 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  8. 1 2 For the (V)CVTSI2SS and (V)CVTSI2SD instructions, variants with a 64-bit source argument are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their source argument is always 32-bit even if VEX.W or EVEX.W is set to 1.

  9. 1 2 3 4 5 6 7 8 9 10 11 12 EVEX-encoded variants of
    • VMOVNTPS, VMOVNTSS
    • VCOMISS, VCOMISD, VUCOMISS, VUCOMISD
    • VCVTSI2SS, VCTSI2SD
    • VCVT(T)SS2SI,VCVT(T)SD2SI
    support neither opmasks nor broadcast.
  10. 1 2 3 4 The CVT(T)PS2PI and CVT(T)PD2PI instructions write their result to MMX register as a vector of two 32-bit signed integers.

    For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTPS2DQ (66 0F 5B /r)
    • CVTTPS2DQ (F3 0F 5B /r)
    • CVTPD2DQ (F2 0F E6 /r)
    • CVTTPD2DQ (66 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  11. 1 2 3 4 For the (V)CVT(T)SS2SI and (V)CVT(T)SD2SI instructions, variants with a 64-bit destination register are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their destination register is always 32-bit even if VEX.W or EVEX.W is set to 1.

  12. 1 2 This instruction cannot be EVEX-encoded. Under AVX512DQ, extracting packed floating-point sign-bits can instead be done with the VPMOVD2M and VPMOVQ2M instructions.
  13. 1 2 The (V)RCPSS, (V)RCPPS, (V)RSQRTSS and (V)RSQRTPS approximation instructions compute their result with a relative error of at most . The exact calculation is implementation-specific and known to vary between different x86 CPUs. [11]
  14. 1 2 3 4 This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4E/4F /r - for its new VRSQRT14* reciprocal square root approximation instructions.

    The main difference between the AVX-512 VRSQRT14* instructions and the older SSE/AVX (V)RSQRT* instructions is that the AVX-512 VRSQRT14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel. [12]

  15. 1 2 3 4 This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4C/4D /r - for its new VRCP14* reciprocal approximation instructions.

    The main difference between the AVX-512 VRCP14* instructions and the older SSE/AVX (V)RCP* instructions is that the AVX-512 VRRCP14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel. [12]

  16. 1 2 3 4 5 6 7 8 The EVEX-encoded versions of the VANDPS, VANDPD, VANDNPS, VANDNPD, VORPS, VORPD, VXORPS, VXORPD instructions are not introduced as part of the AVX512F subset, but instead the AVX512DQ subset.
  17. XORPS/VXORPS with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments.
    Under AVX or AVX-512, it is recommended to use a 128-bit form of VXORPS for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register. [13]
  18. 1 2 3 4 For EVEX-encoded variants of conversions between FP formats of different widths, the opmask lane width is determined by the result format: 64-bit for VCVTPS2PD and VCVTSS2SD and 32-bit for VCVTPD2PS and VCVTSS2SD.
  19. Widening FP→FP conversions (CVTPS2PD, CVTSS2SD, VCVTPH2PD, VCVTSH2SD) support the SAE modifier. Narrowing conversions (CVTPD2PS, CVTSD2SS) support the RC modifier.
  20. 1 2 For the floating-point minimum-value and maximum-value instructions (V)MIN* and (V)MAX*, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as ((op1)>(op2)?(op1):(op2)) for maximum-value and ((op1)<(op2)?(op1):(op2)) for minimum-value.
  21. For the SIMD floating-point compares, the imm8 argument has the following format:
    BitsUsage
    1:0Basic comparison predicate
    2Invert comparison result
    3Invert comparison result if unordered (VEX/EVEX only)
    4Invert signalling behavior (VEX/EVEX only)
    The basic comparison predicates are:
    ValueMeaning
    00bEqual (non-signalling)
    01bLess-than (signalling)
    10bLess-than-or-equal (signalling)
    11bUnordered (non-signalling)
    A signalling compare will cause an exception if any of the inputs are QNaN.

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:

DescriptionInstruction mnemonicsBasic opcodeSSE (66h prefix)AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supportedsubsetlanebcst
Added with SSE2
Unpack and interleave low-order 64-bit integers(V)PUNPCKLQDQ xmm,xmm/m128 [a] 0F 6C /rYesYesYes (W=1)F6464
Unpack and interleave high-order 64-bit integers(V)PUNPCKHQDQ xmm,xmm/m128 [a] 0F 6D /rYesYesYes (W=1)F6464
Right-shift 128-bit unsigned integer by specified number of bytes(V)PSRLDQ xmm,imm8 [a] 0F 73 /3 ibYesYesYesBWNoNo
Left-shift 128-bit integer by specified number of bytes(V)PSLLDQ xmm,imm8 [a] 0F 73 /7 ibYesYesYesBWNoNo
Move 64-bit scalar value from xmm register to xmm register or memory(V)MOVQ xmm/m64,xmm0F D6 /rYesYes (L=0)Yes
(L=0,W=1)
FNoNo
Added with SSE4.1
Variable blend packed bytes.

For each byte lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding byte lane of XMM0.

PBLENDVB xmm,xmm/m128
PBLENDVB xmm,xmm/m128,XMM0 [b]
0F38 10 /rYesNo [c] No [d]
Sign-extend packed integers into wider packed integers8-bit → 16-bit(V)PMOVSXBW xmm,xmm/m640F38 20 /rYesYesYesBW16No
8-bit → 32-bit(V)PMOVSXBD xmm,xmm/m320F38 21 /rYesYesYesF32No
8-bit → 64-bit(V)PMOVSXBQ xmm,xmm/m160F38 22 /rYesYesYesF64No
16-bit → 32-bit(V)PMOVSXWD xmm,xmm/m640F38 23 /rYesYesYesF32No
16-bit → 64-bit(V)PMOVSXWQ xmm,xmm/m320F38 24 /rYesYesYesF64No
32-bit → 64-bit(V)PMOVSXDQ xmm,xmm/m640F38 25 /rYesYesYes (W=0)F64No
Multiply packed 32-bit signed integers, store full 64-bit result.

The input integers are taken from the low 32 bits of each 64-bit vector lane.

(V)PMULDQ xmm,xmm/m1280F38 28 /rYesYesYes (W=1)F6464
Compare packed 64-bit integers for equality(V)PCMPEQQ xmm,xmm/m1280F38 29 /rYesYesYes (W=1) [e] F6464
Aligned non-temporal vector load from memory. [f] (V)MOVNTDQA xmm,m1280F38 2A /rYesYesYes (W=0)FNoNo
Pack 32-bit unsigned integers to 16-bit, with saturation(V)PACKUSDW xmm, xmm/m128 [a] 0F38 2B /rYesYesYes (W=0)BW1632
Zero-extend packed integers into wider packed integers8-bit → 16-bit(V)PMOVZXBW xmm,xmm/m640F38 30 /rYesYesYesBW16No
8-bit → 32-bit(V)PMOVZXBD xmm,xmm/m320F38 31 /rYesYesYesF32No
8-bit → 64-bit(V)PMOVZXBQ xmm,xmm/m160F38 32 /rYesYesYesF64No
16-bit → 32-bit(V)PMOVZXWD xmm,xmm/m640F38 33 /rYesYesYesF32No
16-bit → 64-bit(V)PMOVZXWQ xmm,xmm/m320F38 34 /rYesYesYesF64No
32-bit → 64-bit(V)PMOVZXDQ xmm,xmm/m640F38 35 /rYesYesYes (W=0)F64No
Packed minimum-value of signed integers8-bit(V)PMINSB xmm,xmm/m1280F38 38 /rYesYesYesBW8No
32-bit(V)PMINSD xmm,xmm/m1280F38 39 /rPMINSDVPMINSDVPMINSD(W0)F3232
64-bitVPMINSQ xmm,xmm/m128(AVX-512)VPMINSQ(W1)F6464
Packed minimum-value of unsigned integers16-bit(V)PMINUW xmm,xmm/m1280F38 3A /rYesYesYesBW16No
32-bit(V)PMINUD xmm,xmm/m128
0F38 3B /rPMINUDVPMINUDVPMINUD(W0)F3232
64-bitVPMINUQ xmm,xmm/m128(AVX-512)VPMINUQ(W1)F6464
Packed maximum-value of signed integers8-bit(V)PMAXSB xmm,xmm/m1280F38 3C /rYesYesYesBW8No
32-bit(V)PMAXSD xmm,xmm/m1280F38 3D /rPMAXSDVPMAXSDVPMAXSD(W0)F3232
64-bitVPMAXSQ xmm,xmm/m128(AVX-512)VPMAXSQ(W1)F6464
Packed maximum-value of unsigned integers16-bit(V)PMAXUW xmm,xmm/m1280F38 3E /rYesYesYesBW16No
32-bit(V)PMAXUD xmm,xmm/m128
0F38 3F /rPMAXUDVPMAXUDVPMAXUD(W0)F3232
64-bitVPMAXUQ xmm,xmm/m128(AVX-512)VPMAXUQ(W1)F6464
Multiply packed 32/64-bit integers, store low half of results(V)PMULLD mm,mm/m64
PMULLQ xmm,xmm/m128(AVX-512)
0F38 40 /rPMULLDVPMULLDVPMULLD(W0)F3232
VPMULLQ(W1)DQ6464
Packed Horizontal Word Minimum

Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.

(V)PHMINPOSUW xmm,xmm/m1280F38 41 /rYesYes (L=0)No
Blend Packed Words.

For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.

(V)PBLENDW xmm,xmm/m128,imm8 [a] 0F3A 0E /r ibYesYes [g] No [h]
Extract integer from indexed lane of vector register, and store to GPR or memory.

Zero-extended if stored to GPR.

8-bit(V)PEXTRB r32/m8,xmm,imm8 [i] 0F3A 14 /r ibYesYes (L=0)Yes (L=0)BWNoNo
16-bit(V)PEXTRW r32/m16,xmm,imm8 [i] 0F3A 15 /r ibYesYes (L=0)Yes (L=0)BWNoNo
32-bit(V)PEXTRD r/m32,xmm,imm80F3A 16 /r ibYesYes
(L=0,W=0) [j]
Yes
(L=0,W=0)
DQNoNo
64-bit
(x86-64)
(V)PEXTRQ r/m64,xmm,imm8Yes
(REX.W)
Yes
(L=0,W=1)
Yes
(L=0,W=1)
DQNoNo
Insert integer from general-purpose register into indexed lane of vector register8-bit(V)PINSRB xmm,r32/m8,imm8 [k] 0F3A 20 /r ibYesYes (L=0)Yes (L=0)BWNoNo
32-bit(V)PINSRD xmm,r32/m32,imm80F3A 22 /r ibYesYes
(L=0,W=0) [j]
Yes
(L=0,W=0)
DQNoNo
64-bit
(x86-64)
(V)PINSRQ xmm,r64/m64,imm8Yes
(REX.W)
Yes
(L=0,W=1)
Yes
(L=0,W=1)
DQNoNo
Compute Multiple Packed Sums of Absolute Difference.

The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments src1 and src2 are considered to be two 16-entry arrays of uint8 values and temp is considered to be an 8-entry array of uint16 values, then the operation of the instruction is:

for i = 0 to 7 do     temp[i] := 0     for j = 0 to 3 do          a := src1[ i+(imm8[2]*4)+j ]          b := src2[ (imm8[1:0]*4)+j ]          temp[i] := temp[i] + abs(a-b)     done done dst := temp

For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8.

(V)MPSADBW xmm,xmm/m128,imm80F3A 42 /r ibYesYesYes (W=0)10.2 [l] 16No
Added with SSE 4.2
Compare packed 64-bit signed integers for greater-than(V)PCMPGTQ xmm, xmm/m1280F38 37 /rYesYesYes (W=1) [e] F6464
Packed Compare Explicit Length Strings, Return Mask(V)PCMPESTRM xmm,xmm/m128,imm80F3A 60 /r ibYes [m] Yes (L=0)No
Packed Compare Explicit Length Strings, Return Index(V)PCMPESTRI xmm,xmm/m128,imm80F3A 61 /r ibYes [m] Yes (L=0)No
Packed Compare Implicit Length Strings, Return Mask(V)PCMPISTRM xmm,xmm/m128,imm80F3A 62 /r ibYes [m] Yes (L=0)No
Packed Compare Implicit Length Strings, Return Index(V)PCMPISTRI xmm,xmm/m128,imm80F3A 63 /r ibYes [m] Yes (L=0)No
  1. 1 2 3 4 5 6 For the (V)PUNPCK*, (V)PACKUSDW, (V)PBLENDW, (V)PSLLDQ and (V)PSLRDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. Assemblers may accept PBLENDVB with or without XMM0 as a third argument.
  3. The PBLENDVB instruction with opcode 66 0F38 10 /r is not VEX-encodable. AVX does provide a VPBLENDVB instruction that is similar to PBLENDVB, however, it uses a different opcode and operand encoding - VEX.66.0F3A.W0 4C /r /is4.
  4. Opcode not EVEX-encodable. Under AVX-512, variable blend of packed bytes may be done with the VPBLENDMB instruction (opcode EVEX.66.0F38.W0 66 /r).
  5. 1 2 The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  6. The load performed by (V)MOVNTDQA is weakly-ordered. It may be reordered with respect to other loads, stores and even LOCKs - to impose ordering with respect to other loads/stores, MFENCE or serialization is needed.

    If (V)MOVNTDQA is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent (V)MOVNTDQA instructions may return data from blocks fetched in this manner as long as they are not separated by an MFENCE or serialization.

  7. For AVX, the VBLENDPS and VPBLENDD instructions can be used to perform a blend with 32-bit lanes, allowing one imm8 mask to span a full 256-bit vector without repetition.
  8. Opcode not EVEX-encodable. Under AVX-512, variable blend of packed words may be done with the VPBLENDMW instruction (opcode EVEX.66.0F38.W1 66 /r).
  9. 1 2 For (V)PEXTRB and (V)PEXTRW, if the destination argument is a register, then the extracted 8/16-bit value is zero-extended to 32/64 bits.
  10. 1 2 For the VPEXTRD and VPINSRD instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel [14] but not AMD [15] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings [16] while Sandy Bridge does not [17] )
    In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted as VPEXTRQ/VPINSRQ.
  11. In the case of a register source argument to (V)PINSRB, the argument is considered to be a 32-bit register of which the 8 bottom bits are used, not an 8-bit register proper. This means that it is not possible to specify AH/BH/CH/DH as a source argument to (V)PINSRB.
  12. EVEX-encoded variants of the VMPSADBW instruction are only available if AVX10.2 is supported.
  13. 1 2 3 4 The SSE4.2 packed string compare PCMP*STR* instructions allow their 16-byte memory operands to be misaligned even when using legacy SSE encoding.

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.

DescriptionInstruction mnemonicsBasic opcode  SSE  AVX
(VEX prefix)
AVX-512 (EVEX prefix)
supportedsubsetlanebcstrc/sae
Added with SSE
Load MXCSR (Media eXtension Control and Status Register) from memory(V)LDMXCSR m32NP 0F AE /2YesYes
(L=0)
No
Store MXCSR to memory(V)STMXCSR m32NP 0F AE /3YesYes
(L=0)
No
Added with SSE2
Move a 64-bit data item from MMX register to bottom half of XMM register. Top half is zeroed out.MOVQ2DQ xmm,mmF3 0F D6 /rYesNoNo
Move a 64-bit data item from bottom half of XMM register to MMX register.MOVDQ2Q mm,xmmF2 0F D6 /rYesNoNo
Load a 64-bit integer from memory or XMM register to bottom 64 bits of XMM register, with zero-fill(V)MOVQ xmm,xmm/m64F3 0F 7E /rYesYes (L=0)Yes (L=0,W=1)FNoNoNo
Vector load from unaligned memory or vector register(V)MOVDQU xmm,xmm/m128F3 0F 6F /rYesYesVMOVDQU64(W1)F64NoNo
VMOVDQU32(W0)F32NoNo
F2 0F 6F /rNoNoVMOVDQU16(W1)BW16NoNo
VMOVDQU8(W0)BW8NoNo
Vector store to unaligned memory or vector register(V)MOVDQU xmm/m128,xmmF3 0F 7F /rYesYesVMOVDQU64(W1)F64NoNo
VMOVDQU32(W0)F32NoNo
F2 0F 7F /rNoNoVMOVDQU16(W1)BW16NoNo
VMOVDQU8(W0)BW8NoNo
Shuffle the four top 16-bit lanes of source vector, then place result in top half of destination vector(V)PSHUFHW xmm,xmm/m128,imm8 [a] F3 0F 70 /r ibYesYes [b] YesBW16NoNo
Shuffle the four bottom 16-bit lanes of source vector, then place result in bottom half of destination vector(V)PSHUFLW xmm,xmm/m128,imm8 [a] F2 0F 70 /r ibYesYes [b] YesBW16NoNo
Convert packed signed 32-bit integers to FP32(V)CVTDQ2PS xmm,xmm/m128NP 0F 5B /rYesYesYes (W=0)F3232RC
Convert packed FP32 values to packed signed 32-bit integers(V)CVTPS2DQ xmm,xmm/m12866 0F 5B /rYesYesYes (W=0)F3232RC
Convert packed FP32 values to packed signed 32-bit integers, with round-to-zero(V)CVTTPS2DQ xmm,xmm/m128F3 0F 5B /rYesYesYes (W=0)F3232SAE
Convert packed FP64 values to packed signed 32-bit integers, with round-to-zero(V)CVTTPD2DQ xmm,xmm/m6466 0F E6 /rYesYesYes (W=1)F3264SAE
Convert packed signed 32-bit integers to FP64(V)CVTDQ2PD xmm,xmm/m64F3 0F E6 /rYesYesYes (W=0)F6432RC [c]
Convert packed FP64 values to packed signed 32-bit integers(V)CVTPD2DQ xmm,xmm/m128F2 0F E6 /rYesYesYes (W=1)F3264RC
Added with SSE3
Duplicate floating-point values from even-numbered lanes to next odd-numbered lanes up32-bit(V)MOVSLDUP xmm,xmm/m128F3 0F 12 /rYesYesYes (W=0)F32NoNo
64-bit(V)MOVDDUP xmm/xmm/m128F2 0F 12 /rYesYesYes (W=1)F64NoNo
Duplicate FP32 values from odd-numbered lanes to next even-numbered lanes down(V)MOVSHDUP xmm,xmm/m128F3 0F 16 /rYesYesYes (W=0)F32NoNo
Packed pairwise horizontal addition of floating-point values32-bit(V)HADDPS xmm,xmm/m128 [a] F2 0F 7C /rYesYesNo
64-bit(V)HADDPD xmm,xmm/m128 [a] 66 0F 7C /rYesYesNo
Packed pairwise horizontal subtraction of floating-point values32-bit(V)HSUBPS xmm,xmm/m128 [a] F2 0F 7D /rYesYesNo
64-bit(V)HSUBPD xmm,xmm/m128 [a] 66 0F 7D /rYesYesNo
Packed floating-point add/subtract in alternating lanes. Even-numbered lanes (counting from 0) do subtract, odd-numbered lanes do add.32-bit(V)ADDSUBPS xmm,xmm/m128F2 0F D0 /rYesYesNo
64-bit(V)ADDSUBPD xmm,xmm/m12866 0F D0 /rYesYesNo
Vector load from unaligned memory with looser semantics than (V)MOVDQU.

Unlike (V)MOVDQU, it may fetch data more than once or, for a misaligned access, fetch additional data up until the next 16/32-byte alignment boundaries below/above the actually-requested data.

(V)LDDQU xmm,m128F2 0F F0 /rYesYesNo
Added with SSE4.1
Vector logical test.

Sets ZF=1 if bitwise-AND between first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise

(V)PTEST xmm,xmm/m12866 0F38 17 /rYesYesNo [d]
Variable blend packed floating-point values.

For each lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding lane of XMM0.

32-bitBLENDVPS xmm,xmm/m128
BLENDVPS xmm,xmm/m128,XMM0 [e]
66 0F38 14 /rYesNo [f] No
64-bitBLENDVPD xmm,xmm/m128
BLENDVPD xmm,xmm/m128,XMM0 [e]
66 0F38 15 /rYesNo [f] No
Rounding of packed floating-point values to integer.

Rounding mode specified by imm8 argument.

32-bit(V)ROUNDPS xmm,xmm/m128,imm866 0F3A 08 /r ibYesYesNo [g]
64-bit(V)ROUNDPD xmm,xmm/m128,imm866 0F3A 09 /r ibYesYesNo [g]
Rounding of scalar floating-point value to integer.32-bit(V)ROUNDSS xmm,xmm/m128,imm866 0F3A 0A /r ibYesYesNo [g]
64-bit(V)ROUNDSD xmm,xmm/m128,imm866 0F3A 0B /r ibYesYesNo [g]
Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit.32-bit(V)BLENDPS xmm,xmm/m128,imm866 0F3A 0C /r ibYesYesNo
64-bit(V)BLENDPD xmm,xmm/m128,imm866 0F3A 0D /r ibYesYesNo
Extract 32-bit lane of XMM register to general-purpose register or memory location.

Bits[1:0] of imm8 is used to select lane.

(V)EXTRACTPS r/m32,xmm,imm866 0F3A 17 /r ibYesYes (L=0)Yes (L=0)FNoNoNo
Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register.

If the source argument is an XMM register, then bits[7:6] of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits[5:4] of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits[3:0] of the imm8 provides a bitmap of which lanes to zero out.

(V)INSERTPS xmm,xmm/m32,imm866 0F3A 21 /r ibYesYes (L=0)Yes (L=0,W=0)FNoNoNo
4-component dot-product of 32-bit floating-point values.

Bits [7:4] of the imm8 specify which lanes should participate in the dot-product, bits[3:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

(V)DPPS xmm,xmm/m128,imm8 [a] 66 0F3A 40 /r ibYesYesNo
2-component dot-product of 64-bit floating-point values.

Bits [5:4] of the imm8 specify which lanes should participate in the dot-product, bits[1:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

(V)DPPD xmm,xmm/m128,imm8 [a] 66 0F3A 41 /r ibYesYesNo
Added with SSE4a (AMD only)
64-bit bitfield insert, using the low 64 bits of XMM registers.

First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0).

For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits [69:64] and [77:72] of the second argument, respectively.

INSERTQ xmm,xmm,imm8,imm8F2 0F 78 /r ib ibYesNoNo [h]
INSERTQ xmm,xmmF2 0F 79 /rYesNoNo [h]
64-bit bitfield extract, from the lower 64 bits of an XMM register.

The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to.

For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits[5:0] and bit-offset at bits[13:8].

EXTRQ xmm,imm8,imm866 0F 78 /0 ib ibYesNoNo [h]
EXTRQ xmm,xmm66 0F 79 /rYesNoNo [h]
  1. 1 2 3 4 5 6 7 8 For the VPSHUFLW, VPSHUFHW, VHADDP*, VHSUBP*, VDPPS and VDPPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. 1 2 Under AVX, the VPSHUFHW and VPSHUFLW instructions are only available in 128-bit forms - the 256-bit forms of these instructions require AVX2.
  3. For the EVEX-encoded form of VCVTDQ2PD, EVEX embedded rounding controls are permitted but have no effect.
  4. Opcode not EVEX-encodable. Performing a vector logical test under AVX-512 requires a sequence of at least 2 instructions, e.g. VPTESTMD followed by KORTESTW.
  5. 1 2 Assemblers may accept the BLENDVPS/BLENDVPD instructions with or without XMM0 as a third argument.
  6. 1 2 While AVX does provide VBLENDVPS/VPD instruction that are similar in function to BLENDVPS/VPD, they uses a different opcode and operand encoding - VEX.66.0F3A.W0 4A/4B /r /is4.
  7. 1 2 3 4 Opcode not available under AVX-512. Instead, AVX512F provides different opcodes - EVEX.66.0F3A (08..0B) /r ib - for its new VRNDSCALE* rounding instructions.
  8. 1 2 3 4 Under AVX-512, EVEX-encoding the INSERTQ/EXTRQ opcodes result in AVX-512 instructions completely unrelated to SSE4a, namely VCVT(T)P(S|D)2UQQ and VCVT(T)S(S|D)2USI.


AVX/AVX2 instructions, and AVX-512 extended variants thereof

This covers instructions/opcodes that are new to AVX and AVX2.

AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.

Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.

AVX instructions

Instruction descriptionInstruction mnemonicsBasic opcode (VEX)AVXAVX-512 (EVEX-encoded)
supportedsubsetlanebcst
Zero out upper bits of YMM/ZMM registers. [a]

Zeroes out all bits except bits 127:0 of ymm0 to ymm15.

VZEROUPPERVEX.NP.0F 77 [b] (L=0)No
Zero out YMM/ZMM registers. [a]

Zeroes out registers ymm0 to ymm15.

VZEROALL(L=1)No
Broadcast floating-point data from memory or bottom of XMM-register to all lanes of XMM/YMM/ZMM-register.32-bitVBROADCASTSS ymm,xmm/m32 [c] VEX.66.0F38.W0 18 /rYesYesF32(32) [d]
64-bitVBROADCASTSD ymm,xmm/m64 [c]
VBROADCASTF32X2 zmm,xmm/m64(AVX-512)
VEX.66.0F38 19 /rVBROADCASTSD
(L=1 [e] ,W=0)
VBROADCASTF32X2(L≠0,W=0)DQ32(64) [d]
VBROADCASTSD(L≠0,W=1)F64(64) [d]
128-bitVBROADCASTF128 ymm,m128
VBROADCASTF32X4 zmm,m128(AVX-512)
VBROADCASTF64X2 zmm,m128(AVX-512)
VEX.66.0F38 1A /rVBROADCASTF128
(L=1,W=0)
VBROADCASTF32X4(L≠0,W=0)F32(128) [d]
VBROADCASTF64X2(L≠0,W=1)DQ64(128) [d]
Extract 128-bit vector-lane of floating-point data from wider vector-registerVEXTRACTF128 xmm/m128,ymm,imm8
VEXTRACTF32X4 xmm/m128,zmm,imm8(AVX-512)
VEXTRACTF64X2 xmm/m128,zmm,imm8(AVX-512)
VEX.66.0F3A 19 /r ibVEXTRACTF128
(L=1,W=0)
VEXTRACTF32X4(L≠0,W=0)F32No
VEXTRACTF64X2(L≠0,W=1)DQ64No
Insert 128-bit vector of floating-point data into 128-bit lane of wider vectorVINSERTF128 ymm,ymm,xmm/m128,imm8
VINSERTF32X4 zmm,zmm,xmm/m128,imm8(AVX-512)
VINSERTF64X2 zmm,zmm,xmm/m128,imm8(AVX-512)
VEX.66.0F3A 18 /r ibVINSERTF128
(L=1,W=0)
VINSERTF32X4(L≠0,W=0)F32No
VINSERTF64X2(L≠0,W=1)DQ64No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector
  • Bits [1:0] of imm8 picks element to use for low 128 bits of result
  • Bits[3:2] of imm8 picks element to use for high 128 bits of result
VPERM2F128 ymm,ymm,ymm/m256,imm8VEX.66.0F3A.W0 06 /r /ib(L=1)No
Perform shuffle of 32-bit sub-lanes within each 128-bit lane of vectors.

Variable-shuffle form uses bits[1:0] of each lane for selection.
imm8 form uses same shuffle in every 128-bit lane.

VPERMILPS ymm,ymm,ymm/m256VEX.66.0F38.W0 0C /rYesYesF3232
VPERMILPS ymm,ymm/m256,imm8VEX.66.0F3A.W0 04 /r ibYesYesF3232
Perform shuffle of 64-bit sub-lanes within each 128-bit lane of vectors.

Variable-shuffle form uses bit[1] of each lane for selection.
imm8 form uses two bits of the imm8 for each of the 128-bit lanes.

VPERMILPD ymm,ymm,ymm/m256VEX.66.0F38.W0 0D /rYesYesF6464
VPERMILPD ymm,ymm/m256,imm8VEX.66.0F3A.W0 05 /r ibYesYesF6464
Packed memory load/store of floating-point data with per-lane write masking.

First argument is destination, third argument is source. The second argument provides masks, in the top bit of each 32-bit lane.

32-bitVMASKMOVPS ymm,ymm,m256VEX.66.0F38.W0 2C /rYesNo [f]
VMASKMOVPS m256,ymm,ymmVEX.66.0F38.W0 2E /rYesNo [f]
64-bitVMASKMOVPD ymm,ymm,m256VEX.66.0F38.W0 2D /rYesNo [f]
VMASKMOVPD m256,ymm,ymmVEX.66.0F38.W0 2F /rYesNo [f]
Variable blend packed floating-point values.

For each lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding lane of the fourth argument.

32-bitVBLENDVPS ymm,ymm,ymm/m256,ymmVEX.66.0F3A.W0 4A /r /is4YesNo
64-bitVBLENDVPD ymm,ymm,ymm/m256,ymmVEX.66.0F3A.W0 4B /r /is4YesNo
Variable blend packed bytes.

For each byte lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding byte lane of the fourth argument.

VPBLENDVB xmm,xmm,xmm/m128,xmm [g] VEX.66.0F3A.W0 4C /r is4YesNo
Vector logical sign-bit test on packed floating-point values.

Sets ZF=1 if bitwise-AND between sign-bits of the first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between sign-bits of second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise.

32-bitVTESTPS ymm,ymm/m256VEX.66.0F38.W0 0E /rYesNo
64-bitVTESTPD ymm,ymm/m256VEX.66.0F38.W0 0F /rYesNo
  1. 1 2 For code that may potentially mix use of legacy-SSE instructions with 256-bit AVX instructions, it is strongly recommended to execute a VZEROUPPER or VZEROALL instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation. [18]
  2. While the VZEROUPPER and VZEROALL instructions are architecturally listed as ignoring the VEX.W bit, some early AVX implementations (e.g. Sandy Bridge [19] ) will #UD if the VZEROUPPER and VZEROALL instructions are encoded with VEX.W=1. For this reason, it is recommended to encode these instructions with VEX.W=0.
  3. 1 2 VBROADCASTSS and VBROADCASTSD with a register source operand are not supported under AVX - support for xmm-register source operands for these instructions was added in AVX2.
  4. 1 2 3 4 5 The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
  5. The VBROADCASTSD instruction does not support broadcast of 64-bit data into a 128-bit vector. For broadcast of 64-bit data into a 128-bit vector, the SSE3 (V)MOVDDUP instruction or the AVX2 VPBROADCASTQ instruction may be used.
  6. 1 2 3 4 Under AVX-512, EVEX-encoded forms of the VMASKMOVP(S|D) instructions are not available. For masked moves of FP32/FP64 values to/from memory under AVX-512, the VMOVUPS and VMOVUPD may be used with an opmask register.
  7. Under AVX, the VPBLENDVB instruction is only available with a 128-bit vector width (VEX.L=0). Support for 256-bit vector width was added in AVX2.

AVX2 instructions

Instruction descriptionInstruction mnemonicsBasic opcode (VEX)AVX2AVX-512 (EVEX-encoded)
supportedsubsetlanebcst
Broadcast integer data from memory or bottom lane of XMM-register to all lanes of XMM/YMM/ZMM register8-bitVPBROADCASTB ymm,xmm/m8VEX.66.0F38.W0 78 /rYesYes [a] BW8(8) [b]
16-bitVPBROADCASTW ymm,xmm/m16VEX.66.0F38.W0 79 /rYesYes [a] BW16(16) [b]
32-bitVPBROADCASTD ymm,xmm/m32VEX.66.0F38.W0 58 /rYesYes [a] F32(32) [b]
64-bitVPBROADCASTQ ymm,xmm/m64
VBROADCASTI32X2 zmm,xmm/m64(AVX-512)
VEX.66.0F38 59 /rVPBROADCASTQ
(W=0)
VBROADCASTI32X2(W=0)DQ32(64) [b]
VPBROADCASTQ(W=1) [a] F64(64) [b]
128-bitVBROADCASTI128 ymm,m128
VBROADCASTI32X4 zmm,m128(AVX-512)
VBROADCASTI64X2 zmm,m128(AVX-512)
VEX.66.0F38 5A /rVBROADCASTI128
(L=1,W=0)
VBROADCASTI32X4(L≠0,W=0)F32(128) [b]
VBROADCASTI64X2(L≠0,W=1)DQ64(128) [b]
Extract 128-bit vector-lane of integer data from wider vector-registerVEXTRACTI128 xmm/m128,ymm,imm8
VEXTRACTI32X4 xmm/m128,zmm,imm8(AVX-512)
VEXTRACTI64X2 xmm/m128,zmm,imm8(AVX-512)
VEX.66.0F3A 39 /r ibVEXTRACTI128
(L=1,W=0)
VEXTRACTI32X4(L≠0,W=0)F32No
VEXTRACTI64X2(L≠0,W=1)DQ64No
Insert 128-bit vector of integer data into lane of wider vectorVINSERTI128 ymm,ymm,xmm/m128,imm8
VINSERTI32X4 ymm,ymm,xmm/m128,imm8(AVX-512)
VINSERTI64X2 ymm,ymm,xmm/m128,imm8(AVX-512)
VEX.66.0F3A 38 /r ibVINSERTI128
(L=1,W=0)
VINSERTI32X4(L≠0,W=0)F32No
VINSERTI64X2(L≠0,W=1)DQ64No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector
  • Bits [1:0] of imm8 picks element to use for low 128 bits of result
  • Bits[3:2] of imm8 picks element to use for high 128 bits of result
VPERM2I128 ymm,ymm,ymm/m256,imm8VEX.66.0F3A.W0 46 /r ib(L=1)No
Perform shuffle of FP64 values in vectorVPERMPD ymm,ymm/m256,imm8VEX.66.0F3A.W1 01 /r ib(L=1) [c] Yes (L≠0) [d] F6464
Perform shuffle of 64-bit integers in vectorVPERMQ ymm,ymm/m256,imm8VEX.66.0F3A.W1 00 /r ib(L=1) [c] Yes (L≠0) [d] F6464
Perform variable shuffle of FP32 values in vectorVPERMPS ymm,ymm,ymm/m256VEX.66.0F38.W0 16 /r(L=1) [c] Yes (L≠0)F3232
Perform variable shuffle of 32-bit integers in vectorVPERMD ymm,ymm,ymm/m256VEX.66.0F38.W0 36 /r(L=1) [c] Yes (L≠0)F3232
Packed memory load/store of integer data with per-lane write masking.

First argument is destination, third argument is source. The second argument provides masks, in the top bit of each lane.

32-bitVPMASKMOVD ymm,ymm,m256VEX.66.0F38.W0 8C /rYesNo
VPMASKMOVD m256,ymm,ymmVEX.66.0F38.W0 8E /rYesNo
64-bitVPMASKMOVQ ymm,ymm,m256VEX.66.0F38.W1 8C /rYesNo
VPMASKMOVQ m256,ymm,ymmVEX.66.0F38.W1 8E /rYesNo
Blend packed 32-bit integer values.

For each 32-bit lane of result, pick value from second or third argument depending on the corresponding bit in the imm8 argument.

VPBLENDD ymm,ymm,ymm/m256,imm8VEX.66.0F3A.W0 02 /r ibYesNo
Left-shift packed integers, with per-lane shift-amount32-bitVPSLLVD ymm,ymm,xmm/m256VEX.66.0F38.W0 47 /rYesYesF3232
64-bitVPSLLVQ ymm,ymm,xmm/m256VEX.66.0F38.W1 47 /rYesYesF3264
Right-shift packed signed integers, with per-lane shift-amount32-bitVPSRAVD ymm,ymm,ymm/m256VEX.66.0F38 46 /rVPSRAVD
(W=0)
VPSRAVD(W=0)F3232
64-bitVPSRAVQ zmm,zmm,zmm/m512(AVX-512)VPSRAVQ(W=1)F6464
Right-shift packed unsigned integers, with per-lane shift-amount32-bitVPSRLVD ymm,ymm,ymm/m256VEX.66.0F38.W0 45 /rYesYesF3232
64-bitVPSRLVQ ymm,ymm,ymm/m256VEX.66.0F38.W5 45 /rYesYesF6464
Conditional vector memory gather.

For each 32/64-bit component of a given input vector register, treat the component as an index for an x86 SIB base+scale*index+displacement address calculation, then load a 32/64-bit data item from the computed memory address.

The third argument to the instruction is a mask argument - for each destination vector lane, a memory load is only performed if the MSB of the corresponding mask-argument lane is set to 1. For each load, the corresponding mask-argument lane is zeroed out. [e]

s32→i32VPGATHERDD ymm1,vm32y,ymm2VEX.66.0F38.W0 90 /r /vsibYesYes [e] F32No
s32→i64VPGATHERDQ ymm1,vm32x,ymm2VEX.66.0F38.W1 90 /r /vsibYesYes [e] F64No
s64→i32VPGATHERQD xmm1,vm64y,xmm2VEX.66.0F38.W0 91 /r /vsibYesYes [e] F32No
s64→i64VPGATHERQQ ymm1,vm64y,ymm2VEX.66.0F38.W1 91 /r /vsibYesYes [e] F64No
s32→fp32VGATHERDPS ymm1,vm32y,ymm2VEX.66.0F38.W0 92 /r /vsibYesYes [e] F32No
s32→fp64VGATHERDPD ymm1,vm32x,ymm2VEX.66.0F38.W1 92 /r /vsibYesYes [e] F64No
s64→fp32VGATHERQPS ymm1,vm64y,ymm2VEX.66.0F38.W0 93 /r /vsibYesYes [e] F32No
s64→fp64VGATHERQPD ymm1,vm64x,ymm2VEX.66.0F38.W1 93 /r /vsibYesYes [e] F64No
  1. 1 2 3 4 For AVX-512, variants of the VPBROADCAST(B/W/D/Q) instructions that can use a general-purpose register as source exist as well, with opcodes EVEX.66.0F38.W0 (7A..7C)
  2. 1 2 3 4 5 6 7 The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
  3. 1 2 3 4 For VPERMPS, VPERMPD, VPERMD and VPERMQ, minimum supported vector width is 256 bits. For shuffles in a 128-bit vector, use VPERMILPS or VPERMILPD.
  4. 1 2 Under AVX-512, executing the VPERMPD and VPERMQ instructions with a vector width of 512 bits will cause the operation to be split into two 256-bit halves, with the imm8 swizzle being applied to each half separately.
    Under AVX-512, variable-shuffle variants of the VPERMPD and VPERMQ instructions exist with opcodes EVEX.66.0F38.W1 16 /r and EVEX.66.0F38.W1 36 /r, respectively - these variants do not split their operation into 256-bit halves.
  5. 1 2 3 4 5 6 7 8 9 For EVEX-encoded forms of the V(P)GATHER* instructions under AVX-512, lane-masking is done with an opmask register instead of an XMM/YMM/ZMM vector register.

Other VEX-encoded SIMD instructions

SIMD instructions set extensions that are using the VEX prefix, and are not considered part of baseline AVX/AVX2/AVX-512, FMA3/4 or AMX.

Integer, opmask and cryptographic instructions that use the VEX prefix (e.g. the BMI2, CMPccXADD, VAES and SHA512 extensions) are not included.

Instruction set extensionInstruction descriptionInstruction mnemonicsBasic opcode (VEX)AVX-512 (EVEX-encoded)Added in
supp.sub­setlanebcstrc/sae
F16C
Packed conversions between FP16 and FP32
Packed conversion from FP16 to FP32.VCVTPH2PS ymm1,xmm2/m128VEX.66.0F38.W0 13 /rYesF3216SAE Ivy Bridge,
Piledriver,
Jaguar,
Nano QuadCore C4000,
ZhangJiang
Packed conversion from FP32 to FP16.

Imm8 argument provides rounding controls. [a]

VCVTPS2PH xmm1,ymm2/m256,imm8VEX.66.0F3A.W0 1D /r ibYes F16NoSAE
AVX-VNNI
Vector Neural Network Instructions
For each 32-bit lane, compute an integer dot-product of 8-bit components from the two source arguments (first unsigned, second signed), then add that dot-product result to an accumulator.no saturationVPDPBUSD ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 50 /rYesVNNI3232No
AVX512_VNNI:
Cascade Lake,
Zen 4
AVX-VNNI:
Alder Lake,
Sapphire Rapids,
Zen 5
signed saturationVPDPBUSDS ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 51 /rYesVNNI3232No
For each 32-bit lane, compute an integer dot-product of 16-bit components from the two source arguments (both signed), then add the dot-product result to an accumulator.no saturationVPDPWSSD ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 52 /rYesVNNI3232No
signed saturationVPDPWSSDS ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 53 /rYesVNNI3232No
AVX-IFMA
Integer Fused Multiply Add
For each 64-bit lane, perform an unsigned multiply of the bottom 52 bits of each of the two source arguments, then extract either the low half or the high half of the 104-bit product as an unsigned 52-bit integer that is then added to the corresponding 64-bit lane in the destination register.low halfVPMADD52LUQ ymm1,ymm2,ymm3/m256VEX.66.0F38.W1 B4 /rYesIFMA6464No
AVX512_IFMA:
Cannon Lake,
Ice Lake,
Zen 4
AVX-IFMA:
Lunar Lake,
Arrow Lake
high halfVPMADD52HUQ ymm1,ymm2,ymm3/m256VEX.66.0F38.W1 B5 /rYesIFMA6464No
 
AVX-NE-CONVERT
No-exception FP16/BF16 conversion instructions
Convert packed FP32 to packed BF16 with round-to-nearest-evenVCVTNEPS2BF16 xmm1,ymm2/m256VEX.F3.0F38.W0 72 /r YesBF16 [b] 16 32No
AVX512_BF16:
Cooper Lake,
Zen 4,
Sapphire Rapids
AVX-NE-CONVERT:
Lunar Lake,
Arrow Lake
Load a vector of packed FP16 or BF16 values from memory, then convert all the even or odd elements in that vector (depending on instruction) to packed FP32 values.BF16, evenVCVTNEEBF162PS ymm,m256VEX.F3.0F38.W0 B0 /rNo
FP16, even VCVTNEEPH2PS ymm,m256VEX.66.0F38.W0 B0 /rNo
BF16, oddVCVTNEOBF162PS ymm,m256VEX.F2.0F38.W0 B0 /rNo
FP16, oddVCVTNEOPH2PS ymm,m256VEX.NP.0F38.W0 B0 /rNo
Load scalar FP16 or BF16 value from memory, convert to FP32, then broadcast to destination vector register.BF16 VBCSTNEBF162PS ymm,m16VEX.F3.0F38.W0 B1 /rNo
FP16VBCSTNESH2PS ymm,m16VEX.66.0F38.W0 B1 /rNo
AVX-VNNI-INT8
For each 32-bit lane, compute an integer dot-product of four 8-bit components from the two source arguments, then add the dot-product result to an accumulator. Each of the two source arguments may have ther components treated as either signed or unsigned; the addition to the accumulator may be done with or without saturation (signed or unsigned) depending on instruction.s8*s8VPDPBSSD ymm1,ymm2,ymm3/m256VEX.F2.0F38.W0 50 /rNo Lunar Lake,
Arrow Lake
s8*s8, ssatVPDPBSSDS ymm1,ymm2,ymm3/m256VEX.F2.0F38.W0 51 /rNo
s8*u8VPDPBSUD ymm1,ymm2,ymm3/m256VEX.F3.0F38.W0 50 /rNo
s8*u8, ssatVPDPBSUDS ymm1,ymm2,ymm3/m256VEX.F3.0F38.W0 50 /rNo
u8*u8VPDPBUUD ymm1,ymm2,ymm3/m256VEX.NP.0F38.W0 50 /rNo
u8*u8, usatVPDPBUUDS ymm1,ymm2,ymm3/m256VEX.NP.0F38.W0 50 /rNo
AVX-VNNI-INT16
For each 32-bit lane, compute an integer dot-product of two 16-bit components from the two source arguments, then add the dot-product result to an accumulator. Each of the two source arguments may have their components treated as either signed or unsigned; the addition to the accumulator may be done with or without saturation (signed or unsigned) depending on instruction.s16*u16VPDPWSUD ymm1,ymm2,ymm3/m256VEX.F3.0F38.W0 D2 /rNo Lunar Lake,
Arrow Lake-S
s16*u16, ssatVPDPWSUDS ymm1,ymm2,ymm3/m256VEX.F3.0F38.W0 D3 /rNo
u16*s16VPDPWUSD ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 D2 /rNo
u16*s16, ssatVPDPWUSDS ymm1,ymm2,ymm3/m256VEX.66.0F38.W0 D3 /rNo
u16*u16VPDPWUUD ymm1,ymm2,ymm3/m256VEX.NP.0F38.W0 D2 /rNo
u16*u16, usatVPDPWUUDS ymm1,ymm2,ymm3/m256VEX.NP.0F38.W0 D3 /rNo
  1. For the VCVTPS2PH instruction, if bit 2 if the imm8 argument is set, then the rounding mode to use is taken from the MXCSR, else the rounding mode is taken from bits 1:0 of the imm8 (the top 5 bits of the imm8 are ignored). The supported rounding modes are:
    ValueRounding mode
    0Round to nearest even
    1Round down
    2Round up
    3Round to zero
  2. The VCVTNEPS2BF16 is the only AVX512_BF16 instruction for which the AVX-NE-CONVERT extension provides a VEX-encoded form. The other AVX512_BF16 instructions (none of which have any VEX-encoded forms) are not listed here.


FMA3 and FMA4 instructions

Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024, [20] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:


Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

  1. Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.

AVX-512

AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory. [21] Most of the added instructions may also be used with the 256- and 128-bit registers.

AMX

Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

  1. For TILEZERO, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b.
  2. 1 2 3 For the TILELOADD, TILELOADDT1 and TILESTORED instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride.
    These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written to TILECFG.start_row, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
  3. 1 2 3 4 5 6 7 8 For all of the AMX matrix multiply instructions, the three arguments are required to be three different tile registers, or else the instruction will #UD.

See also

References

  1. Chips and Cheese, The Weird and Wacky World of VIA Part 2: Zhaoxin’s not quite Electric Boogaloo, 22 Sep 2021, see "Partial AVX2 Support" section. Archived on 14 Oct 2024.
  2. Intel, Intel Advanced Vector Extensions 10 Architecture Specification, rev 1.0, order no. 355989-001US, July 2023, section 1.2 on page 14. Archived on 24 Jul 2023.
  3. Intel, The Converged Vector ISA: Intel Advanced Vector Extensions 10 Technical Paper, order no. 356368-003US, March 2025, pages 5 and 8. Archived on 17 Apr 2025.
  4. Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-L, order no. 253666-087US, March 2025, see entry on CPUID instruction on pages 346 and 348. Archived on 4 May 2025.
  5. Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 2B, MOVD/MOVQ instruction entry, page 1289. Archived on 30 Dec 2024.
  6. AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, see entries for MOVD instruction on pages 2159 and 3040. Archived on 19 Jan 2025.
  7. Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 3B section 10.1.1, page 3368. Archived on 30 Dec 2024.
  8. AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, Volume 2, section 7.3.2 on page 650. Archived on 19 Jan 2025.
  9. GCC Bugzilla, Bug 104688 - gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX, see comments 34 and 38 for a statement from Zhaoxin on VMOVDQA atomicity. Archived on 12 Dec 2024.
  10. Stack Overflow, SSE instructions: which CPUs can do atomic 16B memory operations? Archived on 30 Sep 2024.
  11. AMD, The Challenges of Guest Migration, 27 Jun 2009, page 15. Archived from the original on 27 Oct 2014.
  12. 1 2 Intel, Reference Implementations for Intel Architecture Approximation Instructions VRCP14, VRSQRT14, VRCP28, VRSQRT28, and VEXP2, id #671685, Dec 28, 2015. Archived on Sep 18, 2023.

    C code "recip14.c" archived on 18 Sep 2023.

  13. Stack Overflow, Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?, 3 May 2017. Archived on 10 May 1015.
  14. Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 2B, VPEXTRD entry on page 1511 and VPINSRD entry on page 1530. Archived on 30 Dec 2024.
  15. AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, see entry for VPEXTRD on page 2302 and VPINSRD on page 2329. Archived on 19 Jan 2025.
  16. AMD, Revision Guide for AMD Family 15h Models 00h-0Fh Processors, pub.no. 38603 rev 3.24, sep 2014, see erratum 592 on page 37. Archived on 22 Jan 2025.
  17. Intel, 2nd Generation Intel Core Processor Family Desktop Specification Update, order no. 324643-037, apr 2016, see erratum BJ72 on page 43. Archived from the original on 6 Jul 2017.
  18. Intel, Avoiding AVX-SSE Transition Penalties, see section 3.3. Archived on Sep 20, 2024.
  19. Intel, 2nd Generation Intel Core Processor Family Desktop Specification Update, order no. 324643-037, apr 2016, see erratum BJ49 on page 36. Archived from the original on 6 Jul 2017.
  20. Intel, Advanced Vector Extensions 10.2 Architecture Specification, order no. 361050-001, rev 1.0, July 2024. Archived on 1 Aug 2024.
  21. "Intel AVX-512 Instructions". Intel. Retrieved 21 June 2022.