AVX-512

Last updated

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing), [1] and then later in a number of AMD and other Intel CPUs (see list below). AVX-512 consists of multiple extensions that may be implemented independently. [2] This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations.

Contents

Besides widening most 256-bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations. [2] The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX-512-capable processors (see § CPUs with AVX-512)—these instructions may also be used on the 128-bit and 256-bit vector sizes. AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible. [1]

Instruction set

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.

F, CD, ER, PF
Introduced with Xeon Phi x200 (Knights Landing) and Xeon Gold/Platinum (Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing.
  • AVX-512 Foundation (F)  expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon
  • AVX-512 Conflict Detection Instructions (CD)  efficient conflict detection to allow more loops to be vectorized, implemented by Knights Landing [1] and Skylake X
  • AVX-512 Exponential and Reciprocal Instructions (ER)  exponential and reciprocal operations designed to help implement transcendental operations, implemented by Knights Landing [1]
  • AVX-512 Prefetch Instructions (PF)  new prefetch capabilities, implemented by Knights Landing [1]
VL, DQ, BW
Introduced with Skylake X and Cannon Lake.
  • AVX-512 Vector Length Extensions (VL)  extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers [3]
  • AVX-512 Doubleword and Quadword Instructions (DQ)  adds new 32-bit and 64-bit AVX-512 instructions [3]
  • AVX-512 Byte and Word Instructions (BW)  extends AVX-512 to cover 8-bit and 16-bit integer operations [3]
IFMA, VBMI
Introduced with Cannon Lake. [4]
  • AVX-512 Integer Fused Multiply Add (IFMA) – fused multiply add of integers using 52-bit precision.
  • AVX-512 Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW.
4VNNIW, 4FMAPS
Introduced with Knights Mill. [5] [6]
  • AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW) – vector instructions for deep learning, enhanced word, variable precision.
  • AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS) – vector instructions for deep learning, floating point, single precision.
VPOPCNTDQ
Vector population count instruction. Introduced with Knights Mill and Ice Lake. [7]
VNNI, VBMI2, BITALG
Introduced with Ice Lake. [7]
  • AVX-512 Vector Neural Network Instructions (VNNI) – vector instructions for deep learning.
  • AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2) – byte/word load, store and concatenation with shift.
  • AVX-512 Bit Algorithms (BITALG) – byte/word bit manipulation instructions expanding VPOPCNTDQ.
VP2INTERSECT
Introduced with Tiger Lake.
  • AVX-512 Vector Pair Intersection to a Pair of Mask Registers (VP2INTERSECT).
GFNI, VPCLMULQDQ, VAES
Introduced with Ice Lake. [7]
  • These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, PCLMULQDQ and AES instructions.

Encoding and features

The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.

Compared to VEX, EVEX adds the following benefits: [6]

The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.

SIMD modes

The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension (byte & word support). [6]

NameExtension setsRegistersTypes
Legacy SSESSE–SSE4.2xmm0–xmm15single floats. From SSE2: bytes, words, doublewords, quadwords and double floats.
AVX-128 (VEX)AVX, AVX2xmm0–xmm15bytes, words, doublewords, quadwords, single floats and double floats.
AVX-256 (VEX)AVX, AVX2ymm0–ymm15single float and double float. From AVX2: bytes, words, doublewords, quadwords.
AVX-128 (EVEX)AVX-512VLxmm0–xmm31 (k0–k7)doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512-FP16: half float.
AVX-256 (EVEX)AVX-512VLymm0–ymm31 (k0–k7)doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512-FP16: half float.
AVX-512 (EVEX)AVX-512Fzmm0–zmm31 (k0–k7)doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512-FP16: half float.

Extended registers

x64 AVX-512 register scheme as extension from the x64 AVX (YMM0–YMM15) and x64 SSE (XMM0–XMM15) registers
5112562551281270
  ZMM0    YMM0    XMM0  
ZMM1YMM1XMM1
ZMM2YMM2XMM2
ZMM3YMM3XMM3
ZMM4YMM4XMM4
ZMM5YMM5XMM5
ZMM6YMM6XMM6
ZMM7YMM7XMM7
ZMM8YMM8XMM8
ZMM9YMM9XMM9
ZMM10YMM10XMM10
ZMM11YMM11XMM11
ZMM12YMM12XMM12
ZMM13YMM13XMM13
ZMM14YMM14XMM14
ZMM15YMM15XMM15
ZMM16YMM16XMM16
ZMM17YMM17XMM17
ZMM18YMM18XMM18
ZMM19YMM19XMM19
ZMM20YMM20XMM20
ZMM21YMM21XMM21
ZMM22YMM22XMM22
ZMM23YMM23XMM23
ZMM24YMM24XMM24
ZMM25YMM25XMM25
ZMM26YMM26XMM26
ZMM27YMM27XMM27
ZMM28YMM28XMM28
ZMM29YMM29XMM29
ZMM30YMM30XMM30
ZMM31YMM31XMM31

The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

Opmask registers

Most AVX-512 instructions may indicate one of 8 opmask registers (k0k7). For instructions which use a mask register as an opmask, register 'k0' is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, 'k0' is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension. [6] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.

The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.

InstructionExtension setDescription
KANDFBitwise logical AND Masks
KANDNFBitwise logical AND NOT Masks
KMOVFMove from and to Mask Registers or General Purpose Registers
KUNPCKFUnpack for Mask Registers
KNOTFNOT Mask Register
KORFBitwise logical OR Masks
KORTESTFOR Masks And Set Flags
KSHIFTLFShift Left Mask Registers
KSHIFTRFShift Right Mask Registers
KXNORFBitwise logical XNOR Masks
KXORFBitwise logical XOR Masks
KADDBW/DQAdd Two Masks
KTESTBW/DQBitwise comparison and set flags

New instructions in AVX-512 foundation

Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or heavily reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.

Blend using mask

There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.

Since blending is an integral part of the EVEX encoding, these instructions may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

InstructionExtension setDescription
VBLENDMPDFBlend float64 vectors using opmask control
VBLENDMPSFBlend float32 vectors using opmask control
VPBLENDMDFBlend int32 vectors using opmask control
VPBLENDMQFBlend int64 vectors using opmask control
VPBLENDMBBWBlend byte integer vectors using opmask control
VPBLENDMWBWBlend word integer vectors using opmask control

Compare into mask

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking. [6]

ImmediateComparisonDescription
0EQEqual
1LTLess than
2LELess than or equal
3FALSESet to zero
4NEQNot equal
5NLTGreater than or equal
6NLEGreater than
7TRUESet to one
InstructionExtension setDescription
VPCMPD, VPCMPUDFCompare signed/unsigned doublewords into mask
VPCMPQ, VPCMPUQFCompare signed/unsigned quadwords into mask
VPCMPB, VPCMPUBBWCompare signed/unsigned bytes into mask
VPCMPW, VPCMPUWBWCompare signed/unsigned words into mask

Logical set mask

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.

InstructionExtension setDescription
VPTESTMD, VPTESTMQFLogical AND and set mask for 32 or 64 bit integers.
VPTESTNMD, VPTESTNMQFLogical NAND and set mask for 32 or 64 bit integers.
VPTESTMB, VPTESTMWBWLogical AND and set mask for 8 or 16 bit integers.
VPTESTNMB, VPTESTNMWBWLogical NAND and set mask for 8 or 16 bit integers.

Compress and expand

The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

InstructionDescription
VCOMPRESSPD, VCOMPRESSPSStore sparse packed double/single-precision floating-point values into dense memory
VPCOMPRESSD, VPCOMPRESSQStore sparse packed doubleword/quadword integer values into dense memory/register
VEXPANDPD, VEXPANDPSLoad sparse packed double/single-precision floating-point values from dense memory
VPEXPANDD, VPEXPANDQLoad sparse packed doubleword/quadword integer values from dense memory/register

Permute

A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.

InstructionExtension setDescription
VPERMBVBMIPermute packed bytes elements.
VPERMWBWPermute packed words elements.
VPERMT2BVBMIFull byte permute overwriting first source.
VPERMT2WBWFull word permute overwriting first source.
VPERMI2PD, VPERMI2PSFFull single/double floating-point permute overwriting the index.
VPERMI2D, VPERMI2QFFull doubleword/quadword permute overwriting the index.
VPERMI2BVBMIFull byte permute overwriting the index.
VPERMI2WBWFull word permute overwriting the index.
VPERMT2PS, VPERMT2PDFFull single/double floating-point permute overwriting first source.
VPERMT2D, VPERMT2QFFull doubleword/quadword permute overwriting first source.
VSHUFF32x4, VSHUFF64x2, VSHUFI32x4, VSHUFI64x2FShuffle four packed 128-bit lines.
VPMULTISHIFTQBVBMISelect packed unaligned bytes from quadword sources.

Bitwise ternary logic

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed. [6] These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.

The difference in the doubleword and quadword versions is only the application of the opmask.

InstructionDescription
VPTERNLOGD, VPTERNLOGQBitwise Ternary Logic
Bitwise Ternary Logic Truth table
A0A1A2Double AND
(0x80)
Double OR
(0xFE)
Bitwise blend
(0xCA)
000000
001011
010010
011011
100010
101010
110011
111111

Conversions

A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.

InstructionExtension setDescription

VPMOVQD, VPMOVSQD, VPMOVUSQD, VPMOVQW, VPMOVSQW, VPMOVUSQW, VPMOVQB, VPMOVSQB, VPMOVUSQB, VPMOVDW, VPMOVSDW, VPMOVUSDW, VPMOVDB, VPMOVSDB, VPMOVUSDB

FDown convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1.
VPMOVWB, VPMOVSWB, VPMOVUSWBBWDown convert word to byte; unsaturated, saturated or saturated unsigned.
VCVTPS2UDQ, VCVTPD2UDQ, VCVTTPS2UDQ, VCVTTPD2UDQFConvert with or without truncation, packed single or double-precision floating point to packed unsigned doubleword integers.
VCVTSS2USI, VCVTSD2USI, VCVTTSS2USI, VCVTTSD2USIFConvert with or without truncation, scalar single or double-precision floating point to unsigned doubleword integer.
VCVTPS2QQ, VCVTPD2QQ, VCVTPS2UQQ, VCVTPD2UQQ, VCVTTPS2QQ, VCVTTPD2QQ, VCVTTPS2UQQ, VCVTTPD2UQQDQConvert with or without truncation, packed single or double-precision floating point to packed signed or unsigned quadword integers.
VCVTUDQ2PS, VCVTUDQ2PDFConvert packed unsigned doubleword integers to packed single or double-precision floating point.
VCVTUSI2PS, VCVTUSI2PDFConvert scalar unsigned doubleword integers to single or double-precision floating point.
VCVTUSI2SD, VCVTUSI2SSFConvert scalar unsigned integers to single or double-precision floating point.
VCVTUQQ2PS, VCVTUQQ2PDDQConvert packed unsigned quadword integers to packed single or double-precision floating point.
VCVTQQ2PD, VCVTQQ2PSFConvert packed quadword integers to packed single or double-precision floating point.

Floating-point decomposition

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.

InstructionDescription
VGETEXPPD, VGETEXPPSConvert exponents of packed fp values into fp values
VGETEXPSD, VGETEXPSSConvert exponent of scalar fp value into fp value
VGETMANTPD, VGETMANTPSExtract vector of normalized mantissas from float32/float64 vector
VGETMANTSD, VGETMANTSSExtract float32/float64 of normalized mantissa from float32/float64 scalar
VFIXUPIMMPD, VFIXUPIMMPSFix up special packed float32/float64 values
VFIXUPIMMSD, VFIXUPIMMSSFix up special scalar float32/float64 value

Floating-point arithmetic

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14. [6]

InstructionDescription
VRCP14PD, VRCP14PSCompute approximate reciprocals of packed float32/float64 values
VRCP14SD, VRCP14SSCompute approximate reciprocals of scalar float32/float64 value
VRNDSCALEPS, VRNDSCALEPDRound packed float32/float64 values to include a given number of fraction bits
VRNDSCALESS, VRNDSCALESDRound scalar float32/float64 value to include a given number of fraction bits
VRSQRT14PD, VRSQRT14PSCompute approximate reciprocals of square roots of packed float32/float64 values
VRSQRT14SD, VRSQRT14SSCompute approximate reciprocal of square root of scalar float32/float64 value
VSCALEFPS, VSCALEFPDScale packed float32/float64 values with float32/float64 values
VSCALEFSS, VSCALEFSDScale scalar float32/float64 value with float32/float64 value

Broadcast

InstructionExtension setDescription
VBROADCASTSS, VBROADCASTSDF, VLBroadcast single/double floating-point value
VPBROADCASTB, VPBROADCASTW, VPBROADCASTD, VPBROADCASTQF, VL, DQ, BWBroadcast a byte/word/doubleword/quadword integer value
VBROADCASTI32X2, VBROADCASTI64X2, VBROADCASTI32X4, VBROADCASTI32X8, VBROADCASTI64X4F, VL, DQ, BWBroadcast two or four doubleword/quadword integer values

Miscellaneous

InstructionExtension setDescription
VALIGND, VALIGNQF, VLAlign doubleword or quadword vectors
VDBPSADBWBWDouble block packed sum-absolute-differences (SAD) on unsigned bytes
VPABSQFPacked absolute value quadword
VPMAXSQ, VPMAXUQFMaximum of packed signed/unsigned quadword
VPMINSQ, VPMINUQFMinimum of packed signed/unsigned quadword
VPROLD, VPROLVD, VPROLQ, VPROLVQ, VPRORD, VPRORVD, VPRORQ, VPRORVQFBit rotate left or right
VPSCATTERDD, VPSCATTERDQ, VPSCATTERQD, VPSCATTERQQFScatter packed doubleword/quadword with signed doubleword and quadword indices
VSCATTERDPS, VSCATTERDPD, VSCATTERQPS, VSCATTERQPDFScatter packed float32/float64 with signed doubleword and quadword indices

New instructions by sets

Conflict detection

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized. [8]

InstructionNameDescription
VPCONFLICTD, VPCONFLICTQDetect conflicts within vector of packed double- or quadwords valuesCompares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results
VPLZCNTD, VPLZCNTQCount the number of leading zero bits for packed double- or quadword valuesVectorized LZCNT instruction
VPBROADCASTMB2Q, VPBROADCASTMW2DBroadcast mask to vector registerEither 8-bit mask to quadword vector, or 16-bit mask to doubleword vector

Exponential and reciprocal

AVX-512 exponential and reciprocal (AVX-512ER) instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23. [6]

InstructionDescription
VEXP2PD, VEXP2PSCompute approximate exponential 2^x of packed single or double-precision floating-point values
VRCP28PD, VRCP28PSCompute approximate reciprocals of packed single or double-precision floating-point values
VRCP28SD, VRCP28SSCompute approximate reciprocal of scalar single or double-precision floating-point value
VRSQRT28PD, VRSQRT28PSCompute approximate reciprocals of square roots of packed single or double-precision floating-point values
VRSQRT28SD, VRSQRT28SSCompute approximate reciprocal of square root of scalar single or double-precision floating-point value

Prefetch

AVX-512 prefetch (AVX-512PF) instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.

InstructionDescription
VGATHERPF0DPS, VGATHERPF0QPS, VGATHERPF0DPD, VGATHERPF0QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
VGATHERPF1DPS, VGATHERPF1QPS, VGATHERPF1DPD, VGATHERPF1QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
VSCATTERPF0DPS, VSCATTERPF0QPS, VSCATTERPF0DPD, VSCATTERPF0QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
VSCATTERPF1DPS, VSCATTERPF1QPS, VSCATTERPF1DPD, VSCATTERPF1QPDUsing signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.

4FMAPS and 4VNNIW

The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.

InstructionExtension setDescription
V4FMADDPS, V4FMADDSS4FMAPSPacked/scalar single-precision floating-point fused multiply-add (4-iterations)
V4FNMADDPS, V4FNMADDSS4FMAPSPacked/scalar single-precision floating-point fused multiply-add and negate (4-iterations)
VP4DPWSSD4VNNIWDot product of signed words with double word accumulation (4-iterations)
VP4DPWSSDS4VNNIWDot product of signed words with double word accumulation and saturation (4-iterations)

BW, DQ and VBMI

AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension (VPERMB, VPERMI2B, VPERMT2B, VPMULTISHIFTQB).

Two new instructions were added to the mask instructions set: KADD and KTEST (B and W forms with AVX-512DQ, D and Q with AVX-512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW was extended to KUNPCKWD and KUNPCKDQ by AVX-512BW.

Among the instructions added by AVX-512DQ are several SSE and AVX instructions that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.

Instructions that are completely new are covered below.

Floating-point instructions

Three new floating-point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.

The VFPCLASS instructions tests if the floating-point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.

InstructionExtension setDescription
VFPCLASSPS, VFPCLASSPDDQTest types of packed single and double precision floating-point values.
VFPCLASSSS, VFPCLASSSDDQTest types of scalar single and double precision floating-point values.
VRANGEPS, VRANGEPDDQRange restriction calculation for packed floating-point values.
VRANGESS, VRANGESDDQRange restriction calculation for scalar floating-point values.
VREDUCEPS, VREDUCEPDDQPerform reduction transformation on packed floating-point values.
VREDUCESS, VREDUCESDDQPerform reduction transformation on scalar floating-point values.

Other instructions

InstructionExtension setDescription
VPMOVM2D, VPMOVM2QDQConvert mask register to double- or quad-word vector register.
VPMOVM2B, VPMOVM2WBWConvert mask register to byte or word vector register.
VPMOVD2M, VPMOVQ2MDQConvert double- or quad-word vector register to mask register.
VPMOVB2M, VPMOVW2MBWConvert byte or word vector register to mask register.
VPMULLQDQMultiply packed quadword store low result. A quadword version of VPMULLD.

VBMI2

Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.

InstructionDescription
VPCOMPRESSB, VPCOMPRESSWStore sparse packed byte/word integer values into dense memory/register
VPEXPANDB, VPEXPANDWLoad sparse packed byte/word integer values from dense memory/register
VPSHLDConcatenate and shift packed data left logical
VPSHLDVConcatenate and variable shift packed data left logical
VPSHRDConcatenate and shift packed data right logical
VPSHRDVConcatenate and variable shift packed data right logical

VNNI

Vector Neural Network Instructions: [9] AVX512-VNNI adds EVEX-coded instructions described below. With AVX-512F, these instructions can operate on 512-bit vectors, and AVX-512VL further adds support for 128- and 256-bit vectors.

A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors. AVX-VNNI is not part of the AVX-512 suite, it does not require AVX-512F and can be implemented independently.

InstructionDescription
VPDPBUSDMultiply and add unsigned and signed bytes
VPDPBUSDSMultiply and add unsigned and signed bytes with saturation
VPDPWSSDMultiply and add signed word integers
VPDPWSSDSMultiply and add word integers with saturation

IFMA

Integer fused multiply-add instructions. AVX512-IFMA adds EVEX-coded instructions described below.

A separate AVX-IFMA instruction set extension defines VEX encoding of these instructions. This extension is not part of the AVX-512 suite and can be implemented independently.

InstructionExtension setDescription
VPMADD52LUQIFMAPacked multiply of unsigned 52-bit integers and add the low 52-bit products to 64-bit accumulators
VPMADD52HUQIFMAPacked multiply of unsigned 52-bit integers and add the high 52-bit products to 64-bit accumulators

VPOPCNTDQ and BITALG

InstructionExtension setDescription
VPOPCNTD, VPOPCNTQVPOPCNTDQReturn the number of bits set to 1 in doubleword/quadword
VPOPCNTB, VPOPCNTWBITALGReturn the number of bits set to 1 in byte/word
VPSHUFBITQMBBITALGShuffle bits from quadword elements using byte indexes into mask

VP2INTERSECT

InstructionExtension setDescription
VP2INTERSECTD, VP2INTERSECTQVP2INTERSECTCompute intersection between doublewords/quadwords to a pair of mask registers

GFNI

Galois field new instructions are useful for cryptography, [10] as they can be used to implement Rijndael-style S-boxes such as those used in AES, Camellia, and SM4. [11] These instructions may also be used for bit manipulation in networking and signal processing. [10]

GFNI is a standalone instruction set extension and can be enabled separately from AVX or AVX-512. Depending on whether AVX and AVX-512F support is indicated by the CPU, GFNI support enables legacy (SSE), VEX or EVEX-coded instructions operating on 128, 256 or 512-bit vectors.

InstructionDescription
VGF2P8AFFINEINVQBGalois field affine transformation inverse
VGF2P8AFFINEQBGalois field affine transformation
VGF2P8MULBGalois field multiply bytes

VPCLMULQDQ

VPCLMULQDQ with AVX-512F adds an EVEX-encoded 512-bit version of the PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone (that is, on non-AVX512 CPUs) adds only VEX-encoded 256-bit version. (Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).

InstructionDescription
VPCLMULQDQCarry-less multiplication quadword

VAES

VEX- and EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers. The VEX versions can be used without AVX-512 support.

InstructionDescription
VAESDECPerform one round of an AES decryption flow
VAESDECLASTPerform last round of an AES decryption flow
VAESENCPerform one round of an AES encryption flow
VAESENCLASTPerform last round of an AES encryption flow

BF16

AI acceleration instructions operating on the Bfloat16 numbers.

InstructionDescription
VCVTNE2PS2BF16Convert two vectors of packed single precision numbers into one vector of packed Bfloat16 numbers
VCVTNEPS2BF16Convert one vector of packed single precision numbers to one vector of packed Bfloat16 numbers
VDPBF16PSCalculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

FP16

An extension of the earlier F16C instruction set, adding comprehensive support for the binary16 floating-point numbers (also known as FP16, float16 or half-precision floating-point numbers). The new instructions implement most operations that were previously available for single and double-precision floating-point numbers and also introduce new complex number instructions and conversion instructions. Scalar and packed operations are supported.

Unlike the single and double-precision format instructions, the half-precision operands are neither conditionally flushed to zero (FTZ) nor conditionally treated as zero (DAZ) based on MXCSR settings. Subnormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ bit. [12]

Arithmetic instructions

InstructionDescription
VADDPH, VADDSHAdd packed/scalar FP16 numbers.
VSUBPH, VSUBSHSubtract packed/scalar FP16 numbers.
VMULPH, VMULSHMultiply packed/scalar FP16 numbers.
VDIVPH, VDIVSHDivide packed/scalar FP16 numbers.
VSQRTPH, VSQRTSHCompute square root of packed/scalar FP16 numbers.
VFMADD{132, 213, 231}PH, VFMADD{132, 213, 231}SHMultiply-add packed/scalar FP16 numbers.
VFNMADD{132, 213, 231}PH, VFNMADD{132, 213, 231}SHNegated multiply-add packed/scalar FP16 numbers.
VFMSUB{132, 213, 231}PH, VFMSUB{132, 213, 231}SHMultiply-subtract packed/scalar FP16 numbers.
VFNMSUB{132, 213, 231}PH, VFNMSUB{132, 213, 231}SHNegated multiply-subtract packed/scalar FP16 numbers.
VFMADDSUB{132, 213, 231}PHMultiply-add (odd vector elements) or multiply-subtract (even vector elements) packed FP16 numbers.
VFMSUBADD{132, 213, 231}PHMultiply-subtract (odd vector elements) or multiply-add (even vector elements) packed FP16 numbers.
VREDUCEPH, VREDUCESHPerform reduction transformation of the packed/scalar FP16 numbers.
VRNDSCALEPH, VRNDSCALESHRound packed/scalar FP16 numbers to a given number of fraction bits.
VSCALEFPH, VSCALEFSHScale packed/scalar FP16 numbers by multiplying it by a power of two.

Complex arithmetic instructions

InstructionDescription
VFMULCPH, VFMULCSHMultiply packed/scalar complex FP16 numbers.
VFCMULCPH, VFCMULCSHMultiply packed/scalar complex FP16 numbers. Complex conjugate form of the operation.
VFMADDCPH, VFMADDCSHMultiply-add packed/scalar complex FP16 numbers.
VFCMADDCPH, VFCMADDCSHMultiply-add packed/scalar complex FP16 numbers. Complex conjugate form of the operation.

Approximate reciprocal instructions

InstructionDescription
VRCPPH, VRCPSHCompute approximate reciprocal of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2−11+2−14.
VRSQRTPH, VRSQRTSHCompute approximate reciprocal square root of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2−14.

Comparison instructions

InstructionDescription
VCMPPH, VCMPSHCompare the packed/scalar FP16 numbers and store the result in a mask register.
VCOMISHCompare the scalar FP16 numbers and store the result in the flags register. Signals an exception if a source operand is QNaN or SNaN.
VUCOMISHCompare the scalar FP16 numbers and store the result in the flags register. Signals an exception only if a source operand is SNaN.
VMAXPH, VMAXSHSelect the maximum of each vertical pair of the source packed/scalar FP16 numbers.
VMINPH, VMINSHSelect the minimum of each vertical pair of the source packed/scalar FP16 numbers.
VFPCLASSPH, VFPCLASSSHTest packed/scalar FP16 numbers for special categories (NaN, infinity, negative zero, etc.) and store the result in a mask register.

Conversion instructions

InstructionDescription
VCVTW2PHConvert packed signed 16-bit integers to FP16 numbers.
VCVTUW2PHConvert packed unsigned 16-bit integers to FP16 numbers.
VCVTDQ2PHConvert packed signed 32-bit integers to FP16 numbers.
VCVTUDQ2PHConvert packed unsigned 32-bit integers to FP16 numbers.
VCVTQQ2PHConvert packed signed 64-bit integers to FP16 numbers.
VCVTUQQ2PHConvert packed unsigned 64-bit integers to FP16 numbers.
VCVTPS2PHXConvert packed FP32 numbers to FP16 numbers. Unlike VCVTPS2PH from F16C, VCVTPS2PHX has a different encoding that also supports broadcasting.
VCVTPD2PHConvert packed FP64 numbers to FP16 numbers.
VCVTSI2SHConvert a scalar signed 32-bit or 64-bit integer to FP16 number.
VCVTUSI2SHConvert a scalar unsigned 32-bit or 64-bit integer to FP16 number.
VCVTSS2SHConvert a scalar FP32 number to FP16 number.
VCVTSD2SHConvert a scalar FP64 number to FP16 number.
VCVTPH2W, VCVTTPH2WConvert packed FP16 numbers to signed 16-bit integers. VCVTPH2W rounds the value according to the MXCSR register. VCVTTPH2W rounds toward zero.
VCVTPH2UW, VCVTTPH2UWConvert packed FP16 numbers to unsigned 16-bit integers. VCVTPH2UW rounds the value according to the MXCSR register. VCVTTPH2UW rounds toward zero.
VCVTPH2DQ, VCVTTPH2DQConvert packed FP16 numbers to signed 32-bit integers. VCVTPH2DQ rounds the value according to the MXCSR register. VCVTTPH2DQ rounds toward zero.
VCVTPH2UDQ, VCVTTPH2UDQConvert packed FP16 numbers to unsigned 32-bit integers. VCVTPH2UDQ rounds the value according to the MXCSR register. VCVTTPH2UDQ rounds toward zero.
VCVTPH2QQ, VCVTTPH2QQConvert packed FP16 numbers to signed 64-bit integers. VCVTPH2QQ rounds the value according to the MXCSR register. VCVTTPH2QQ rounds toward zero.
VCVTPH2UQQ, VCVTTPH2UQQConvert packed FP16 numbers to unsigned 64-bit integers. VCVTPH2UQQ rounds the value according to the MXCSR register. VCVTTPH2UQQ rounds toward zero.
VCVTPH2PSXConvert packed FP16 numbers to FP32 numbers. Unlike VCVTPH2PS from F16C, VCVTPH2PSX has a different encoding that also supports broadcasting.
VCVTPH2PDConvert packed FP16 numbers to FP64 numbers.
VCVTSH2SI, VCVTTSH2SIConvert a scalar FP16 number to signed 32-bit or 64-bit integer. VCVTSH2SI rounds the value according to the MXCSR register. VCVTTSH2SI rounds toward zero.
VCVTSH2USI, VCVTTSH2USIConvert a scalar FP16 number to unsigned 32-bit or 64-bit integer. VCVTSH2USI rounds the value according to the MXCSR register. VCVTTSH2USI rounds toward zero.
VCVTSH2SSConvert a scalar FP16 number to FP32 number.
VCVTSH2SDConvert a scalar FP16 number to FP64 number.

Decomposition instructions

InstructionDescription
VGETEXPPH, VGETEXPSHExtract exponent components of packed/scalar FP16 numbers as FP16 numbers.
VGETMANTPH, VGETMANTSHExtract mantissa components of packed/scalar FP16 numbers as FP16 numbers.

Move instructions

InstructionDescription
VMOVSHMove scalar FP16 number to/from memory or between vector registers.
VMOVWMove scalar FP16 number to/from memory or general purpose register.

Legacy instructions with EVEX-encoded versions

GroupLegacy encodingInstructionsAVX-512
extensions
SSE
SSE2
MMX
AVX
SSE3
SSE4
AVX2
FMA
FVLBWDQ
VADDYesYesNoVADDPD, VADDPS, VADDSD, VADDSSYYNN
VANDVANDPD, VANDPS, VANDNPD, VANDNPSNY
VCMPVCMPPD, VCMPPS, VCMPSD, VCMPSSYNN
VCOMVCOMISD, VCOMISS
VDIVVDIVPD, VDIVPS, VDIVSD, VDIVSSY
VCVTVCVTDQ2PD, VCVTDQ2PS, VCVTPD2DQ, VCVTPD2PS, VCVTPH2PS, VCVTPS2PH, VCVTPS2DQ, VCVTPS2PD, VCVTSD2SI, VCVTSD2SS, VCVTSI2SD, VCVTSI2SS, VCVTSS2SD, VCVTSS2SI, VCVTTPD2DQ, VCVTTPS2DQ, VCVTTSD2SI, VCVTTSS2SI
VMAXVMAXPD, VMAXPS, VMAXSD, VMAXSS
VMINVMINPD, VMINPS, VMINSD, VMINSSN
VMOVVMOVAPD, VMOVAPS, VMOVD, VMOVQ, VMOVDDUP, VMOVHLPS, VMOVHPD, VMOVHPS, VMOVLHPS, VMOVLPD, VMOVLPS, VMOVNTDQA, VMOVNTDQ, VMOVNTPD, VMOVNTPS, VMOVSD, VMOVSHDUP, VMOVSLDUP, VMOVSS, VMOVUPD, VMOVUPS, VMOVDQA32, VMOVDQA64, VMOVDQU8, VMOVDQU16, VMOVDQU32, VMOVDQU64YY
VMULVMULPD, VMULPS, VMULSD, VMULSSN
VORVORPD, VORPSNY
VSQRTVSQRTPD, VSQRTPS, VSQRTSD, VSQRTSSYN
VSUBVSUBPD, VSUBPS, VSUBSD, VSUBSS
VUCOMIVUCOMISD, VUCOMISSN
VUNPCKVUNPCKHPD, VUNPCKHPS, VUNPCKLPD, VUNPCKLPSY
VXORVXORPD, VXORPSNY
VEXTRACTPSNoYesNoVEXTRACTPSYNN
VINSERTPSVINSERTPS
VPEXTRVPEXTRB, VPEXTRW, VPEXTRD, VPEXTRQNYY
VPINSRVPINSRB, VPINSRW, VPINSRD, VPINSRQ
VPACKYesYesYesVPACKSSWB, VPACKSSDW, VPACKUSDW, VPACKUSWBYN
VPADDVPADDB, VPADDW, VPADDD, VPADDQ, VPADDSB, VPADDSW, VPADDUSB, VPADDUSWY
VPANDVPANDD, VPANDQ, VPANDND, VPANDNQN
VPAVGVPAVGB, VPAVGWNY
VPCMPVPCMPEQB, VPCMPEQW, VPCMPEQD, VPCMPEQQ, VPCMPGTB, VPCMPGTW, VPCMPGTD, VPCMPGTQY
VPMAXVPMAXSB, VPMAXSW, VPMAXSD, VPMAXSQ, VPMAXUB, VPMAXUW, VPMAXUD, VPMAXUQ
VPMINVPMINSB, VPMINSW, VPMINSD, VPMINSQ, VPMINUB, VPMINUW, VPMINUD, VPMINUQ
VPMOVVPMOVSXBW, VPMOVSXBD, VPMOVSXBQ, VPMOVSXWD, VPMOVSXWQ, VPMOVSXDQ, VPMOVZXBW, VPMOVZXBD, VPMOVZXBQ, VPMOVZXWD, VPMOVZXWQ, VPMOVZXDQ
VPMULVPMULDQ, VPMULUDQ, VPMULHRSW, VPMULHUW, VPMULHW, VPMULLD, VPMULLQ, VPMULLW
VPORVPORD, VPORQN
VPSUBVPSUBB, VPSUBW, VPSUBD, VPSUBQ, VPSUBSB, VPSUBSW, VPSUBUSB, VPSUBUSWY
VPUNPCKVPUNPCKHBW, VPUNPCKHWD, VPUNPCKHDQ, VPUNPCKHQDQ, VPUNPCKLBW, VPUNPCKLWD, VPUNPCKLDQ, VPUNPCKLQDQ
VPXORVPXORD, VPXORQN
VPSADBWVPSADBWNY
VPSHUFVPSHUFB, VPSHUFHW, VPSHUFLW, VPSHUFD, VPSLLDQ, VPSLLW, VPSLLD, VPSLLQ, VPSRAW, VPSRAD, VPSRAQ, VPSRLDQ, VPSRLW, VPSRLD, VPSRLQ, VPSLLVW, VPSLLVD, VPSLLVQ, VPSRLVW, VPSRLVD, VPSRLVQ, VPSHUFPD, VPSHUFPSY
VEXTRACTNoYesYesVEXTRACTF32X4, VEXTRACTF64X2, VEXTRACTF32X8, VEXTRACTF64X4, VEXTRACTI32X4, VEXTRACTI64X2, VEXTRACTI32X8, VEXTRACTI64X4NY
VINSERTVINSERTF32x4, VINSERTF64X2, VINSERTF32X8, VINSERTF64x4, VINSERTI32X4, VINSERTI64X2, VINSERTI32X8, VINSERTI64X4
VPABSVPABSB, VPABSW, VPABSD, VPABSQYN
VPALIGNRVPALIGNRN
VPERMVPERMD, VPERMILPD, VPERMILPS, VPERMPD, VPERMPS, VPERMQYN
VPMADDVPMADDUBSWVPMADDWDNY
VFMADDNoNoYesVFMADD132PD, VFMADD213PD, VFMADD231PD, VFMADD132PS, VFMADD213PS, VFMADD231PS, VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMADD132SS, VFMADD213SS, VFMADD231SSYN
VFMADDSUBVFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD, VFMADDSUB132PS, VFMADDSUB213PS, VFMADDSUB231PS
VFMSUBADDVFMSUBADD132PD, VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS, VFMSUBADD213PS, VFMSUBADD231PS
VFMSUBVFMSUB132PD, VFMSUB213PD, VFMSUB231PD, VFMSUB132PS, VFMSUB213PS, VFMSUB231PS, VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, VFMSUB132SS, VFMSUB213SS, VFMSUB231SS
VFNMADDVFNMADD132PD, VFNMADD213PD, VFNMADD231PD, VFNMADD132PS, VFNMADD213PS, VFNMADD231PS, VFNMADD132SD, VFNMADD213SD, VFNMADD231SD, VFNMADD132SS, VFNMADD213SS, VFNMADD231SS
VFNMSUBVFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD, VFNMSUB132PS, VFNMSUB213PS, VFNMSUB231PS, VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD, VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS
VGATHERVGATHERDPS, VGATHERDPD, VGATHERQPS, VGATHERQPD
VPGATHERVPGATHERDD, VPGATHERDQ, VPGATHERQD, VPGATHERQQ
VPSRAVVPSRAVW, VPSRAVD, VPSRAVQY

CPUs with AVX-512

SubsetFCDERPF4FMAPS4VNNIWVPOPCNTDQVLDQBWIFMAVBMIVNNIBF16VBMI2BITALGVPCLMULQDQGFNIVAESVP2INTERSECTFP16
Knights Landing (Xeon Phi x200, 2016)YesYesNo
Knights Mill (Xeon Phi x205, 2017)YesNo
Skylake-SP, Skylake-X (2017)NoNoYesNo
Cannon Lake (2018)YesNo
Cascade Lake (2019)NoYesNo
Cooper Lake (2020)YesNo
Ice Lake (2019)YesNoYesNo
Tiger Lake (2020)YesNo
Rocket Lake (2021)No
Alder Lake (2021)Partial Note 1 Partial Note 1
Zen 4 (2022)YesYesNo
Sapphire Rapids (2023)NoYes
Zen 5 (2024)YesNo

^Note 1 : Intel does not officially support AVX-512 family of instructions on the Alder Lake microprocessors. Intel has disabled in silicon (fused off) AVX-512 on recent steppings of Alder Lake microprocessors to prevent customers from enabling AVX-512. [33] In older Alder Lake family CPUs with some legacy combinations of BIOS and microcode revisions, it was possible to execute AVX-512 family instructions when disabling all the efficiency cores which do not contain the silicon for AVX-512. [34] [35] [22]

Performance

Intel Vectorization Advisor (starting from version 2017) supports native AVX-512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization. [36] [37]

On some processors (mostly pre-Ice Lake Intel), AVX-512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors for Intel targets. [38] [39] [40]

C/C++ compilers also automatically handle loop unrolling and preventing stalls in the pipeline in order to use AVX-512 most effectively, which means a programmer using language intrinsics to try to force use of AVX-512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code. [41] In other cases, using AVX-512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++. [42]

Reception

There are many examples of AVX-512 applications, including media processing, cryptography, video games, [43] neural networks, [44] and even OpenJDK, which employs AVX-512 for sorting. [45]

In a much-cited quote from 2020, Linus Torvalds said "I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on," [46] stating that he would prefer the transistor budget be spent on additional cores and integer performance instead, and that he "detests" floating point benchmarks. [47]

Numenta touts their "highly sparse" [48] neural network technology, which they say obviates the need for GPUs as their algorithms run on CPUs with AVX-512. [49] They claim a ten times speedup relative to A100 largely because their algorithms reduce the size of the neural network, while maintaining accuracy, by techniques such as the Sparse Evolutionary Training (SET) algorithm [50] and Foresight Pruning. [51]

See also

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

AltiVec is a single-precision floating point and integer SIMD instruction set designed and owned by Apple, IBM, and Freescale Semiconductor — the AIM alliance. It is implemented on versions of the PowerPC processor architecture, including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's PWRficient PA6T. AltiVec is a trademark owned solely by Freescale, so the system is also referred to as Velocity Engine by Apple and VMX by IBM and P.A. Semi.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.

Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions, and a new coding scheme.

The VEX prefix and VEX coding scheme are an extension to the IA-32 and x86-64 instruction set architecture for microprocessors from Intel, AMD and others.

An AES instruction set is a set of instructions that are specifically designed to perform AES encryption and decryption operations efficiently. These instructions are typically found in modern processors and can greatly accelerate AES operations compared to software implementations. An AES instruction set includes instructions for key expansion, encryption, and decryption using various key sizes.

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008 and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring represents the polynomial . The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2k) than the traditional instruction set.

Open Watcom Assembler or WASM is an x86 assembler produced by Watcom, based on the Watcom Assembler found in Watcom C/C++ compiler and Watcom FORTRAN 77. Further development is being done on the 32- and 64-bit JWASM project, which more closely matches the syntax of Microsoft's assembler.

<span class="mw-page-title-main">Skylake (microarchitecture)</span> CPU microarchitecture by Intel

Skylake is Intel's codename for its sixth generation Core microprocessor family that was launched on August 5, 2015, succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology as its predecessor, serving as a tock in Intel's tick–tock manufacturing and design model. According to Intel, the redesign brings greater CPU and GPU performance and reduced power consumption. Skylake CPUs share their microarchitecture with Kaby Lake, Coffee Lake, Whiskey Lake, and Comet Lake CPUs.

<span class="mw-page-title-main">Xeon Phi</span> Series of x86 manycore processors from Intel

Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.

In computer architecture, 512-bit integers, memory addresses, or other data units are those that are 512 bits wide. Also, 512-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size. There are currently no mainstream general-purpose processors built to operate on 512-bit integers or addresses, though a number of processors do operate on 512-bit data.

The EVEX prefix and corresponding coding scheme is an extension to the 32-bit x86 (IA-32) and 64-bit x86-64 (AMD64) instruction set architecture. EVEX is based on, but should not be confused with the MVEX prefix used by the Knights Corner processor.

Sunny Cove is a codename for a CPU microarchitecture developed by Intel, first released in September 2019. It succeeds the Palm Cove microarchitecture and is fabricated using Intel's 10 nm process node. The microarchitecture is implemented in 10th-generation Intel Core processors for mobile and third generation Xeon scalable server processors. 10th-generation Intel Core mobile processors were released in September 2019, while the Xeon server processors were released on April 6, 2021.

References

  1. 1 2 3 4 5 6 James Reinders (23 July 2013). "AVX-512 Instructions". Intel . Retrieved 20 August 2013.
  2. 1 2 Kusswurm 2022, p. 223.
  3. 1 2 3 James Reinders (17 July 2014). "Additional AVX-512 instructions". Intel . Retrieved 3 August 2014.
  4. Anton Shilov. "Intel 'Skylake' processors for PCs will not support AVX-512 instructions". Kitguru.net. Retrieved 2015-03-17.
  5. "Intel will add deep-learning instructions to its processors". 14 October 2016.
  6. 1 2 3 4 5 6 7 8 "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel . Retrieved 2014-01-29.
  7. 1 2 3 4 5 6 "Intel Architecture Instruction Set Extensions and Future Features Programming Reference". Intel. Retrieved 2017-10-16.
  8. "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel . Retrieved 25 February 2014.
  9. "Intel® Deep Learning Boost" (PDF). Intel. Retrieved 2021-10-11.
  10. 1 2 "Galois Field New Instructions (GFNI) Technology Guide". networkbuilders.intel.com.
  11. Kivilinna, Jussi (19 April 2023). "camellia-simd-aesni". GitHub . Newer x86-64 processors also support Galois Field New Instructions (GFNI) which allow implementing Camellia s-box more straightforward manner and yield even better performance.
  12. "Intel® AVX512-FP16 Architecture Specification, June 2021, Revision 1.0, Ref. 347407-001US" (PDF). Intel. 2021-06-30. Retrieved 2021-07-04.
  13. "Intel Xeon Phi Processor product brief". Intel. Retrieved 12 October 2016.
  14. "Intel unveils X-series platform: Up to 18 cores and 36 threads, from $242 to $2,000". Ars Technica. Retrieved 2017-05-30.
  15. "Intel Advanced Vector Extensions 2015/2016: Support in GNU Compiler Collection" (PDF). Gcc.gnu.org. Retrieved 2016-10-20.
  16. Patrizio, Andy (21 September 2015). "Intel's Xeon roadmap for 2016 leaks". Itworld.org. Retrieved 2016-10-20.
  17. "Intel Core i9-11900K Review - World's Fastest Gaming Processor?". www.techpowerup.com. 30 March 2021.
  18. ""Add rocketlake to gcc" commit". gcc.gnu.org.
  19. "Intel Celeron 6305 Processor (4M Cache, 1.80 GHz, with IPU) Product Specifications". ark.intel.com. Archived from the original on 2020-10-18. Retrieved 2020-11-10.
  20. Laptop Murah Kinerja Boleh Diadu | HP 14S DQ2518TU , retrieved 2021-08-08
  21. "Using the GNU Compiler Collection (GCC): x86 Options". GNU. Retrieved 2019-10-14.
  22. 1 2 Cutress, Ian; Frumusanu, Andrei. "The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity". www.anandtech.com. Retrieved 5 November 2021.
  23. Larabel, Michael. "Intel Core i9 12900K "Alder Lake" AVX-512 On Linux". www.phoronix.com. Retrieved 2021-11-08.
  24. Larabel, Michael. "AVX-512 Performance Comparison: AMD Genoa vs. Intel Sapphire Rapids & Ice Lake". www.phoronix.com. Retrieved 2023-01-19.
  25. "The industry's first high-performance x86 SOC with server-class CPUs and integrated AI coprocessor technology". 2 August 2022. Archived from the original on December 12, 2019.{{cite web}}: CS1 maint: unfit URL (link)
  26. "x86, x64 Instruction Latency, Memory Latency and CPUID dumps (instlatx64)". users.atw.hu.
  27. "AMD Zen 4 Based Ryzen CPUs May Feature Up to 24 Cores, Support for AVX512 Vectors". Hardware Times. 2021-05-23. Retrieved 2021-09-02.
  28. Hagedoorn, Hilbert (18 May 2021). "AMD working on a prodigious 96-core EPYC processor". Guru3D.com. Retrieved 2021-05-25.
  29. clamchowder (2021-08-23). "Details on the Gigabyte Leak". Chips And Cheese. Retrieved 2022-06-10.
  30. W1zzard (26 May 2022). "AMD Answers Our Zen 4 Tech Questions, with Robert Hallock". TechPowerUp. Retrieved 2022-05-29.
  31. Larabel, Michael (2022-09-26). "AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9 7950X". www.phoronix.com.
  32. Larabel, Michael (2024-02-10). "AMD Zen 5 Compiler Support Posted For GCC - Confirms New AVX Features & More". www.phoronix.com.
  33. Alcorn, Paul (2022-03-02). "Intel Nukes Alder Lake's AVX-512 Support, Now Fuses It Off in Silicon". Tom's Hardware. Retrieved 2022-03-07.
  34. Cutress, Ian; Frumusanu, Andrei (2021-08-19). "Intel Architecture Day 2021: Alder Lake, Golden Cove, and Gracemont Detailed". AnandTech. Retrieved 2021-08-25.
  35. Alcorn, Paul (2021-08-19). "Intel Architecture Day 2021: Alder Lake Chips, Golden Cove and Gracemont Cores". Tom's Hardware. Retrieved 2021-08-21.
  36. "Intel Advisor XE 2016 Update 3 What's new - Intel Software". Software.intel.com. Retrieved 2016-10-20.
  37. "Intel Advisor - Intel Software". Software.intel.com. Retrieved 2016-10-20.
  38. Cordes, Peter. "SIMD instructions lowering CPU frequency". Stack Overflow.
  39. Cordes, Peter. "why does gcc auto-vectorization for tigerlake use ymm not zmm registers". Stack Overflow.
  40. "LLVM 10.0.0 Release Notes".
  41. Matthew Kolbe (2023-10-10). Lightning Talk: How to Leverage SIMD Intrinsics for Massive Slowdowns - Matthew Kolbe - CppNow 2023. C++Now. Retrieved 2023-10-15 via YouTube.
  42. Clausecker, Robert (2023-08-05). "Transcoding Unicode Characters with AVX-512 Instructions". arXiv: 2212.05098 [cs.DS].
  43. Szewczyk, Chris (2021-11-24). "The RPCS3 PS3 emulator gets a hefty boost on Intel Alder Lake CPUs with AVX-512 enabled". PC Gamer . Retrieved 2023-10-11.
  44. Carneiro, André; Serpa, Matheus (2021-09-05). "Lightweight Deep Learning Applications on AVX-512". 2021 IEEE Symposium on Computers and Communications (ISCC). Athens: IEEE. pp. 1–6. doi:10.1109/ISCC53001.2021.9631464.
  45. Parasa, Srinivas (2023-05-30). "JDK-8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays)". OpenJDK . Retrieved 2023-10-11.
  46. Tung, Liam (2020-07-13). "Linus Torvalds: I hope Intel's AVX-512 dies a painful death". ZDNet. Retrieved 2023-10-11.
  47. Torvalds, Linus (2020-07-11). "Alder Lake and AVX-512". realworldtech.com. Retrieved 2023-10-11.
  48. "Sparsity Enables 100x Performance Acceleration in Deep Learning Networks: A Technology Demonstration" (PDF). numenta.com. 2021-05-20. Retrieved 2023-10-11.
  49. Afifi-Sabet, Keumars (2023-10-08). "A tiny startup has helped Intel trounce AMD and Nvidia in critical AI tests — is it game over already?". TechRadar . Retrieved 2023-10-11.
  50. Mocanu, Decebal; Mocanu, Elena (2018). "Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science". Nature Communications . 9 (1): 2383. Bibcode:2018NatCo...9.2383M. doi:10.1038/s41467-018-04316-3. PMC   6008460 . PMID   29921910.
  51. Souza, Lucas (2020-10-30). "The Case for Sparsity in Neural Networks, Part 2: Dynamic Sparsity". numenta.com. Retrieved 2023-10-11.