XOP instruction set

Last updated

The XOP (eXtended Operations [1] ) instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12, 2011. [2] However AMD removed support for XOP from Zen (microarchitecture) onward. [3]

Contents

The XOP instruction set contains several different types of vector instructions since it was originally intended as a major upgrade to SSE. Most of the instructions are integer instructions, but it also contains floating point permutation and floating point fraction extraction instructions. See the index for a list of instruction types.

History

XOP is a revised subset of what was originally intended as SSE5. It was changed to be similar but not overlapping with AVX, parts that overlapped with AVX were removed or moved to separate standards such as FMA4 (floating-point vector multiply–accumulate) and CVT16 (Half-precision floating-point conversion implemented as F16C by Intel). [1]

All SSE5 instructions that were equivalent or similar to instructions in the AVX and FMA4 instruction sets announced by Intel have been changed to use the coding proposed by Intel. Integer instructions without equivalents in AVX were classified as the XOP extension. [1] The XOP instructions have an opcode byte 8F (hexadecimal), but otherwise almost identical coding scheme as AVX with the 3-byte VEX prefix.

Commentators [4] have seen this as evidence that Intel has not allowed AMD to use any part of the large VEX coding space. AMD has been forced to use different codes in order to avoid using any code combination that Intel might possibly be using in its development pipeline for something else. The XOP coding scheme is as close to the VEX scheme as technically possible without risking that the AMD codes overlap with future Intel codes. This inference is speculative, since no public information is available about negotiations between the two companies on this issue.

The use of the 8F byte requires that the m-bits (see VEX coding scheme) have a value larger than or equal to 8 in order to avoid overlap with existing instructions. [Note 1] The C4 byte used in the VEX scheme has no such restriction. This may prevent the use of the m-bits for other purposes in the future in the XOP scheme, but not in the VEX scheme. Another possible problem is that the pp bits have the value 00 in the XOP scheme, while they have the value 01 in the VEX scheme for instructions that have no legacy equivalent. This may complicate the use of the pp bits for other purposes in the future.

A similar compatibility issue is the difference between the FMA3 and FMA4 instruction sets. Intel initially proposed FMA4 in AVX/FMA specification version 3 to supersede the 3-operand FMA proposed by AMD in SSE5. After AMD adopted FMA4, Intel canceled FMA4 support and reverted to FMA3 in the AVX/FMA specification version 5 (See FMA history). [1] [5] [6]

In March 2015, AMD explicitly revealed in the description of the patch for the GNU Binutils package that Zen, its third-generation x86-64 architecture in its first iteration (znver1 – Zen, version 1), will not support TBM, FMA4, XOP and LWP instructions developed specifically for the "Bulldozer" family of micro-architectures. [7] [8]

Integer vector multiply–accumulate instructions

These are integer version of the FMA instruction set. These are all four operand instructions similar to FMA4 and they all operate on signed integers.

InstructionDescription [9] Operation
VPMACSWW

VPMACSSWW

Multiply Accumulate (with Saturation) Word to Word2x8 words (a0-a7, b0-b7) + 8 words (c0-c7) → 8 words (r0-r7)

r0 = a0 * b0 + c0, r1 = a1 * b1 + c1, ..

VPMACSWD

VPMACSSWD

Multiply Accumulate (with Saturation) Low Word to Doubleword2x8 words (a0-a7, b0-b7) + 4 doublewords (c0-c3) → 4 doublewords (r0-r3)

r0 = a0 * b0 + c0, r1 = a2 * b2 + c1, . [2]

VPMACSDD

VPMACSSDD

Multiply Accumulate (with Saturation) Doubleword to Doubleword2x4 doublewords (a0-a3, b0-b3) + 4 doublewords (c0-c3) → 4 doublewords (r0-r3)

r0 = a0 * b0 + c0, r1 = a1 * b1 + c1, ..

VPMACSDQL

VPMACSSDQL

Multiply Accumulate (with Saturation) Low Doubleword to Quadword2x4 doublewords (a0-a3, b0-b3) + 2 quadwords (c0-c1) → 2 quadwords (r0-r3)

r0 = a0 * b0 + c0, r1 = a2 * b2 + c1

VPMACSDQH

VPMACSSDQH

Multiply Accumulate (with Saturation) High Doubleword to Quadword2x4 doublewords (a0-a3, b0-b3) + 2 quadwords (c0-c1) → 2 quadwords (r0-r3)

r0 = a1 * b1 + c0, r1 = a3 * b3 + c1

VPMADCSWD

VPMADCSSWD

Multiply Add Accumulate (with Saturation) Word to Doubleword2x8 words (a0-a7, b0-b7) + 4 doublewords (c0-c3) → 4 doublewords (r0-r3)

r0 = a0 * b0 + a1 * b1 + c0, r1 = a2 * b2 + a3 * b3 + c1, ..

Integer vector horizontal addition

Horizontal addition instructions adds adjacent values in the input vector to each other. The output size in the instructions below describes how wide the horizontal addition performed is. For instance horizontal byte to word adds two bytes at a time and returns the result as vector of words, but byte to quadword adds eight bytes together at a time and returns the result as vector of quadwords. Six additional horizontal addition and subtraction instructions can be found in SSSE3, but they operate on two input vectors and only does two and two operations.

InstructionDescription [9] Operation
VPHADDBW

VPHADDUBW

Horizontal add two signed/unsigned bytes to word16 bytes (a0-a15) → 8 words (r0-r7)

r0 = a0+a1, r1 = a2+a3, r2 = a4+a5, ...

VPHADDBD

VPHADDUBD

Horizontal add four signed/unsigned bytes to doubleword16 bytes (a0-a15) → 4 doublewords (r0-r3)

r0 = a0+a1+a2+a3, r1 = a4+a5+a6+a7, ...

VPHADDBQ

VPHADDUBQ

Horizontal add eight signed/unsigned bytes to quadword16 bytes (a0-a15) → 2 quadwords (r0-r1)

r0 = a0+a1+a2+a3+a4+a5+a6+a7, ...

VPHADDWD

VPHADDUWD

Horizontal add two signed/unsigned words to doubleword8 words (a0-a7) → 4 doublewords (r0-r3)

r0 = a0+a1, r1 = a2+a3, r2 = a4+a5, ...

VPHADDWQ

VPHADDUWQ

Horizontal add four signed/unsigned words to quadword8 words (a0-a7) → 2 quadwords (r0-r1)

r0 = a0+a1+a2+a3, r1 = a4+a5+a6+a7

VPHADDDQ

VPHADDUDQ

Horizontal add two signed/unsigned doublewords to quadword4 doublewords (a0-a3) → 2 quadwords (r0-r1)

r0 = a0+a1, r1 = a2+a3

VPHSUBBWHorizontal subtract two signed bytes to word16 bytes (a0-a15) → 8 words (r0-r7)

r0 = a0-a1, r1 = a2-a3, r2 = a4-a5, ...

VPHSUBWDHorizontal subtract two signed words to doubleword8 words (a0-a7) → 4 doublewords (r0-r3)

r0 = a0-a1, r1 = a2-a3, r2 = a4-a5, ...

VPHSUBDQHorizontal subtract two signed doublewords to quadword4 doublewords (a0-a3) → 2 quadwords (r0-r1)

r0 = a0-a1, r1 = a2-a3

Integer vector compare

This set of vector compare instructions all take an immediate as an extra argument. The immediate controls what kind of comparison is performed. There are eight comparison possible for each instruction. The vectors are compared and all comparisons that evaluate to true set all corresponding bits in the destination to 1, and false comparisons sets all the same bits to 0. This result can be used directly in VPCMOV instruction for a vectorized conditional move.

InstructionDescription [9] ImmediateComparison
VPCOMBCompare Vector Signed Bytes000Less Than
VPCOMWCompare Vector Signed Words001Less Than or Equal
VPCOMDCompare Vector Signed Doublewords010Greater Than
VPCOMQCompare Vector Signed Quadwords011Greater Than or Equal
VPCOMUBCompare Vector Unsigned Bytes100Equal
VPCOMUWCompare Vector Unsigned Words101Not Equal
VPCOMUDCompare Vector Unsigned Doublewords110False
VPCOMUQCompare Vector Unsigned Quadwords111True

Vector conditional move

VPCMOV works as bitwise variant of the blend instructions in SSE4. For each bit in the selector 1 selects the same bit in the first source, and 0 selects the same in the second source. When used together with the XOP vector comparison instructions above this can be used to implement a vectorized ternary move, or if the second input is the same as the destination, a conditional move (CMOV).

InstructionDescription [9]
VPCMOVVector Conditional Move

Integer vector shift and rotate instructions

The shift instructions here differ from those in SSE2 in that they can shift each unit with a different amount using a vector register interpreted as packed signed integers. The sign indicates the direction of shift or rotate, with positive values causing left shift and negative right shift [10] Intel has specified a different incompatible set of variable vector shift instructions in AVX2. [11]

InstructionDescription [9]
VPROTBPacked Rotate Bytes
VPROTWPacked Rotate Words
VPROTDPacked Rotate Doublewords
VPROTQPacked Rotate Quadwords
VPSHABPacked Shift Arithmetic Bytes
VPSHAWPacked Shift Arithmetic Words
VPSHADPacked Shift Arithmetic Doublewords
VPSHAQPacked Shift Arithmetic Quadwords
VPSHLBPacked Shift Logical Bytes
VPSHLWPacked Shift Logical Words
VPSHLDPacked Shift Logical Doublewords
VPSHLQPacked Shift Logical Quadwords

Vector permute

VPPERM is a single instruction that combines the SSSE3 instruction PALIGNR and PSHUFB and adds more to both. Some compare it the Altivec instruction VPERM. [12] It takes three registers as input, the first two are source registers and the third the selector register. Each byte in the selector selects one of the bytes in one of the two input registers for the output. The selector can also apply effects on the selected bytes such as setting it to 0, reverse the bit order, and repeating most signicating bit. All of the effects or the input can in addition be inverted.

The VPERMIL2PD and VPERMIL2PS instructions are two source versions of the VPERMILPD and VPERMILPS instructions in AVX which means like VPPERM they can select output from any of the fields in the two inputs.

InstructionDescription [9]
VPPERMPacked Permute Byte
VPERMIL2PDPermute Two-Source Double-Precision Floating-Point
VPERMIL2PSPermute Two-Source Single-Precision Floating-Point

Floating-point fraction extraction

These instructions extracts the fractional part of floating point, that is the part that would be lost in conversion to integer.

InstructionDescription [9]
VFRCZPDExtract Fraction Packed Double-Precision Floating-Point
VFRCZPSExtract Fraction Packed Single-Precision Floating-Point
VFRCZSDExtract Fraction Scalar Double-Precision Floating-Point
VFRCZSSExtract Fraction Scalar Single-Precision Floating Point

CPUs with XOP

See also

Notes

  1. Byte value 0x8F is an exising opcode for a POP instruction. This instruction uses the ModR/M byte, which follows the opcode, but it does not make use of the "reg" (register) field, which is bits 3-5. Some opcodes which don't use "reg" multiplex instructions by using these bits to signify eight different instructions (0x80-0x83 and 0xD0-0xDF, among others); 0x8F does not. This means, for a standard POP instruction, bits 3-5 should always be zero. Since the m-bits are bits 0-4, requiring a value 8 or higher sets bit 3 of the byte following 0x8F.

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors.

MMX (instruction set) Instruction set designed by Intel

MMX is a single instruction, multiple data (SIMD) instruction set architecture designed by Intel, introduced on January 8, 1997 with its Pentium P5 (microarchitecture) based line of microprocessors, named "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on IA-32 processors by Intel and other vendors as of 1997.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

3DNow! is a deprecated extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing of floating-point vector-operations using Vector registers, which improves the performance of many graphic-intensive applications. The first microprocessor to implement 3DNow was the AMD K6-2, which was introduced in 1998. When the application was appropriate, this raised the speed by about 2–4 times.

SSE2 is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow!, SSE, and SSE2.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

The AMD Bulldozer Family 15h is a microprocessor microarchitecture for the FX and Opteron line of processors, developed by AMD for the desktop and server markets. Bulldozer is the codename for this family of microarchitectures. It was released on October 12, 2011, as the successor to the K10 microarchitecture.

The SSE5 was a SIMD instruction set extension proposed by AMD on August 30, 2007 as a supplement to the 128-bit SSE core instructions in the AMD64 architecture.

Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

The VEX prefix and VEX coding scheme are an extension to the x86 and x86-64 instruction set architecture for microprocessors from Intel, AMD and others.

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations. There are two variants:

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008 and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring represents the polynomial . The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2k) than the traditional instruction set.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 and Skylake-X CPUs; this includes the Core-X series, as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series.

The F16C instruction set is an x86 instruction set architecture extension which provides support for converting between half-precision and standard IEEE single-precision floating-point formats.

The EVEX prefix and corresponding coding scheme is an extension to the 32-bit x86 (IA-32) and 64-bit x86-64 (AMD64) instruction set architecture. EVEX is based on, but should not be confused with the MVEX prefix used by the Knights Corner processor.

Bit manipulation instructions sets are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers.

References

  1. 1 2 3 4 Dave Christie (2009-05-07), Striking a balance, AMD Developer blogs, archived from the original on 2013-11-04, retrieved 2013-11-04
  2. 1 2 AMD64 Architecture Programmer's Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions (PDF), AMD, May 1, 2009
  3. Michael Larabel (March 3, 2017). "The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance". Phoronix . But with Zen being a clean-sheet design, there are some instruction set extensions found in Bulldozer processors not found in Zen/znver1. Those no longer present include FMA4 and XOP.
  4. Agner Fog (December 5, 2009), Stop the instruction set war
  5. Intel AVX Programming Reference (PDF), March 2008, retrieved 2012-01-17
  6. Intel Advanced Vector Extensions Programming Reference, January 2009, archived from the original on February 29, 2012, retrieved 2012-01-17
  7. Ganesh Gopalasubramanian (March 10, 2015). "[PATCH] add znver1 processor". binutils@sourceware.org (Mailing list).
  8. Amit Pawar (August 7, 2015). "[PATCH] Remove CpuFMA4 From Znver1 CPU Flags". binutils@sourceware.org (Mailing list).
  9. 1 2 3 4 5 6 7 "AMD64 Architecture Programmer's Manual, Volume4: 128-Bit and 256-Bit Media Instructions" (PDF). AMD . Retrieved 2014-01-13.
  10. "New "Bulldozer" and "Piledriver" Instructions" (PDF). AMD . Retrieved 2014-01-13.
  11. "Intel Architecture Instruction Set Extensions Programming Reference". Intel. Archived from the original (PDF) on February 1, 2014. Retrieved 2014-01-29.
  12. "Buldozer x264 optimisations" . Retrieved 2014-01-13.
  13. Dave Christie (2009-05-07), Striking a balance, AMD Developer blogs, archived from the original on 2013-11-09, retrieved 2012-01-17
  14. New "Bulldozer" and "Piledriver" Instructions (PDF), AMD, October 2012