Bit manipulation instructions sets (BMI sets) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers.
There are two sets published by Intel: BMI (now referred to as BMI1) and BMI2; they were both introduced with the Haswell microarchitecture with BMI1 matching features offered by AMD's ABM instruction set and BMI2 extending them. Another two sets were published by AMD: ABM (Advanced Bit Manipulation, which is also a subset of SSE4a implemented by Intel as part of SSE4.2 and BMI1), and TBM (Trailing Bit Manipulation, an extension introduced with Piledriver-based processors as an extension to BMI1, but dropped again in Zen-based processors). [1]
AMD was the first to introduce the instructions that now form Intel's BMI1 as part of its ABM (Advanced Bit Manipulation) instruction set, then later added support for Intel's new BMI2 instructions. AMD today advertises the availability of these features via Intel's BMI1 and BMI2 cpuflags and instructs programmers to target them accordingly. [2]
While Intel considers POPCNT
as part of SSE4.2 and LZCNT
as part of BMI1, both Intel and AMD advertise the presence of these two instructions individually. POPCNT
has a separate CPUID flag of the same name, and Intel and AMD use AMD's ABM
flag to indicate LZCNT
support (since LZCNT
combined with BMI1 and BMI2 completes the expanded ABM instruction set). [2] [3]
Encoding | Instruction | Description [4] |
---|---|---|
F3 0F B8 /r | POPCNT | Population count |
F3 0F BD /r | LZCNT | Leading zeros count |
LZCNT
is related to the Bit Scan Reverse (BSR
) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than setting the ZF (if the source is zero). Also, it produces a defined result (the source operand size in bits) if the source operand is zero. For a non-zero argument, sum of LZCNT
and BSR
results is argument bit width minus 1 (for example, if 32-bit argument is 0x000f0000
, LZCNT gives 12, and BSR gives 19).
The encoding of LZCNT
is such that if ABM is not supported, then the BSR
instruction is executed instead. [4] : 227
The instructions below are those enabled by the BMI
bit in CPUID. Intel officially considers LZCNT
as part of BMI, but advertises LZCNT
support using the ABM
CPUID feature flag. [3] BMI1 is available in AMD's Jaguar, [5] Piledriver [6] and newer processors, and in Intel's Haswell [7] and newer processors.
Encoding | Instruction | Description [3] | Equivalent C expression [8] [9] [10] |
---|---|---|---|
VEX.LZ.0F38 F2 /r | ANDN | Logical and not | ~x & y |
VEX.LZ.0F38 F7 /r | BEXTR | Bit field extract (with register) | (src >> start) & ((1 << len) - 1) |
VEX.LZ.0F38 F3 /3 | BLSI | Extract lowest set isolated bit | x & -x |
VEX.LZ.0F38 F3 /2 | BLSMSK | Get mask up to lowest set bit | x ^ (x - 1) |
VEX.LZ.0F38 F3 /1 | BLSR | Reset lowest set bit | x & (x - 1) |
F3 0F BC /r | TZCNT | Count the number of trailing zero bits | 31+(!x)-(((x&-x)&0x0000FFFF)?16:0)-(((x&-x)&0x00FF00FF)?8:0)-(((x&-x)&0x0F0F0F0F)?4:0)-(((x&-x)&0x33333333)?2:0)-(((x&-x)&0x55555555)?1:0) |
TZCNT
is almost identical to the Bit Scan Forward (BSF
) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than setting the ZF (if the source is zero). For a non-zero argument, the result of TZCNT
and BSF
is equal.
As with LZCNT
, the encoding of TZCNT
is such that if BMI1 is not supported, then the BSF
instruction is executed instead. [4] : 352
Intel introduced BMI2 together with BMI1 in its line of Haswell processors. Only AMD has produced processors supporting BMI1 without BMI2; BMI2 is supported by AMDs Excavator architecture and newer. [11]
Encoding | Instruction | Description |
---|---|---|
VEX.LZ.0F38 F5 /r | BZHI | Zero high bits starting with specified bit position [src & (1 << inx)-1]; |
VEX.LZ.F2.0F38 F6 /r | MULX | Unsigned multiply without affecting flags, and arbitrary destination registers |
VEX.LZ.F2.0F38 F5 /r | PDEP | Parallel bits deposit |
VEX.LZ.F3.0F38 F5 /r | PEXT | Parallel bits extract |
VEX.LZ.F2.0F3A F0 /r ib | RORX | Rotate right logical without affecting flags |
VEX.LZ.F3.0F38 F7 /r | SARX | Shift arithmetic right without affecting flags |
VEX.LZ.F2.0F38 F7 /r | SHRX | Shift logical right without affecting flags |
VEX.LZ.66.0F38 F7 /r | SHLX | Shift logical left without affecting flags |
The PDEP
and PEXT
instructions are new generalized bit-level compress and expand instructions. They take two inputs; one is a source, and the other is a selector. The selector is a bitmap selecting the bits that are to be packed or unpacked. PEXT
copies selected bits from the source to contiguous low-order bits of the destination; higher-order destination bits are cleared. PDEP
does the opposite for the selected bits: contiguous low-order bits are copied to selected bits of the destination; other destination bits are cleared. This can be used to extract any bitfield of the input, and even do a lot of bit-level shuffling that previously would have been expensive. While what these instructions do is similar to bit level gather-scatter SIMD instructions, PDEP
and PEXT
instructions (like the rest of the BMI instruction sets) operate on general-purpose registers. [12]
The instructions are available in 32-bit and 64-bit versions. An example using arbitrary source and selector in 32-bit mode is:
Instruction | Selector mask | Source | Destination |
---|---|---|---|
PEXT | 0xff00fff0 | 0x12345678 | 0x00012567 |
PDEP | 0xff00fff0 | 0x00012567 | 0x12005670 |
AMD processors before Zen 3 [13] that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles [14] rather than (Zen 3) 3 cycles. [15] As a result it is often faster to use other instructions on these processors. [16]
TBM consists of instructions complementary to the instruction set started by BMI1; their complementary nature means they do not necessarily need to be used directly but can be generated by an optimizing compiler when supported. AMD introduced TBM together with BMI1 in its Piledriver [6] line of processors; later AMD Jaguar and Zen-based processors do not support TBM. [5] No Intel processors (at least through Alder Lake) support TBM.
Encoding | Instruction | Description [4] | Equivalent C expression [17] [9] |
---|---|---|---|
XOP.LZ.0A 10 /r id | BEXTR | Bit field extract (with immediate) | (src >> start) & ((1 << len) - 1) |
XOP.LZ.09 01 /1 | BLCFILL | Fill from lowest clear bit | x & (x + 1) |
XOP.LZ.09 02 /6 | BLCI | Isolate lowest clear bit | x | ~(x + 1) |
XOP.LZ.09 01 /5 | BLCIC | Isolate lowest clear bit and complement | ~x & (x + 1) |
XOP.LZ.09 02 /1 | BLCMSK | Mask from lowest clear bit | x ^ (x + 1) |
XOP.LZ.09 01 /3 | BLCS | Set lowest clear bit | x | (x + 1) |
XOP.LZ.09 01 /2 | BLSFILL | Fill from lowest set bit | x | (x - 1) |
XOP.LZ.09 01 /6 | BLSIC | Isolate lowest set bit and complement | ~x | (x - 1) |
XOP.LZ.09 01 /7 | T1MSKC | Inverse mask from trailing ones | ~x | (x + 1) |
XOP.LZ.09 01 /4 | TZMSK | Mask from trailing zeros | ~x & (x - 1) |
Note that instruction extension support means the processor is capable of executing the supported instructions for software compatibility purposes. The processor might not perform well doing so. For example, Excavator through Zen 2 processors implement PEXT and PDEP instructions using microcode resulting in the instructions executing significantly slower than the same behaviour recreated using other instructions. [20] (A software method called "zp7" is, in fact, faster on these machines.) [21] For optimum performance it is recommended that compiler developers choose to use individual instructions in the extensions based on architecture specific performance profiles rather than on extension availability.
The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.
In the x86 architecture, the CPUID instruction is a processor supplementary instruction allowing software to discover details of the processor. It was introduced by Intel in 1993 with the launch of the Pentium and SL-enhanced 486 processors.
Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.
SSE4 is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation. SSE4 extended the SSE3 instruction set which was released in early 2004. All software using previous Intel SIMD instructions are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.
Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge microarchitecture shipping in Q1 2011 and later by AMD with the Bulldozer microarchitecture shipping in Q4 2011. AVX provides new features, new instructions, and a new coding scheme.
Haswell is the codename for a processor microarchitecture developed by Intel as the "fourth-generation core" successor to the Ivy Bridge. Intel officially announced CPUs based on this microarchitecture on June 4, 2013, at Computex Taipei 2013, while a working Haswell chip was demonstrated at the 2011 Intel Developer Forum. Haswell was the last generation of Intel processor to have socketed processors on mobile. With Haswell, which uses a 22 nm process, Intel also introduced low-power processors designed for convertible or "hybrid" ultrabooks, designated by the "U" suffix. Haswell began shipping to manufacturers and OEMs in mid-2013, with its desktop chips officially launched in September 2013.
The XOP instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12, 2011. However AMD removed support for XOP from Zen (microarchitecture) onward.
The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations. There are two variants:
Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008 and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring represents the polynomial . The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2k) than the traditional instruction set.
RDRAND
is an instruction for returning random numbers from an Intel on-chip hardware random number generator which has been seeded by an on-chip entropy source. It is also known as Intel Secure Key Technology, codenamed Bull Mountain. Intel introduced the feature around 2012, and AMD added support for the instruction in June 2015.
Transactional Synchronization Extensions (TSX), also called Transactional Synchronization Extensions New Instructions (TSX-NI), is an extension to the x86 instruction set architecture (ISA) that adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. According to different benchmarks, TSX/TSX-NI can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.
The F16C instruction set is an x86 instruction set architecture extension which provides support for converting between half-precision and standard IEEE single-precision floating-point formats.
The Puma Family 16h is a low-power microarchitecture by AMD for its APUs. It succeeds the Jaguar as a second-generation version, targets the same market, and belongs to the same AMD architecture Family 16h. The Beema line of processors are aimed at low-power notebooks, and Mullins are targeting the tablet sector.
AMD Athlon X4 is a series of budget AMD microprocessors for personal computers. These processors are distinct from A-Series APUs of the same era due to the lack of iGPUs.