Find first set

Last updated

In computer software and hardware, find first set (ffs) or find first one is a bit operation that, given an unsigned machine word, [nb 1] designates the index or position of the least significant bit set to one in the word counting from the least significant bit position. A nearly equivalent operation is count trailing zeros (ctz) or number of trailing zeros (ntz), which counts the number of zero bits following the least significant one bit. The complementary operation that finds the index or position of the most significant set bit is log base 2, so called because it computes the binary logarithm ⌊log2(x)⌋. [1] This is closely related to count leading zeros (clz) or number of leading zeros (nlz), which counts the number of zero bits preceding the most significant one bit. [nb 2] There are two common variants of find first set, the POSIX definition which starts indexing of bits at 1, [2] herein labelled ffs, and the variant which starts indexing of bits at zero, which is equivalent to ctz and so will be called by that name.

Contents

Most modern CPU instruction set architectures provide one or more of these as hardware operators; software emulation is usually provided for any that aren't available, either as compiler intrinsics or in system libraries.

Examples

Given the following 32-bit word:

0000 0000 0000 0000 1000 0000 0000 1000

The count trailing zeros operation would return 3, while the count leading zeros operation returns 16. The count leading zeros operation depends on the word size: if this 32-bit word were truncated to a 16-bit word, count leading zeros would return zero. The find first set operation would return 4, indicating the 4th position from the right. The truncated log base 2 is 15.

Similarly, given the following 32-bit word, the bitwise negation of the above word:

1111 1111 1111 1111 0111 1111 1111 0111

The count trailing ones operation would return 3, the count leading ones operation would return 16, and the find first zero operation ffz would return 4.

If the word is zero (no bits set), count leading zeros and count trailing zeros both return the number of bits in the word, while ffs returns zero. Both log base 2 and zero-based implementations of find first set generally return an undefined result for the zero word.

Hardware support

Many architectures include instructions to rapidly perform find first set and/or related operations, listed below. The most common operation is count leading zeros (clz), likely because all other operations can be implemented efficiently in terms of it (see Properties and relations).

PlatformMnemonicNameOperand widthsDescriptionOn application to 0
ARM (ARMv5T architecture and later)
except Cortex-M0/M0+/M1/M23
clz [3] Count Leading Zeros32clz32
ARM (ARMv8-A architecture)clzCount Leading Zeros32, 64clzOperand width
AVR32 clz [4] Count Leading Zeros32clz32
DEC Alpha ctlz [5] Count Leading Zeros64clz64
cttz [5] Count Trailing Zeros64ctz64
Intel 80386 and laterbsf [6] Bit Scan Forward16, 32, 64ctzUndefined; sets zero flag
bsr [6] Bit Scan Reverse16, 32, 64Log base 2Undefined; sets zero flag
x86 supporting BMI1 or ABM lzcnt [7] Count Leading Zeros16, 32, 64clzOperand width; sets carry flag
x86 supporting BMI1 tzcnt [8] Count Trailing Zeros16, 32, 64ctzOperand width; sets carry flag
Itanium clz [9] Count Leading Zeros64clz64
MIPS32/MIPS64 clz [10] [11] Count Leading Zeros in Word32, 64clzOperand width
clo [10] [11] Count Leading Ones in Word32, 64cloOperand width
Motorola 68020 and laterbfffo [12] Find First One in Bit FieldArbitraryLog base 2Field offset + field width
PDP-10 jffoJump if Find First One36clz0; no operation
POWER/PowerPC/Power ISA cntlz/cntlzw/cntlzd [13] Count Leading Zeros32, 64clzOperand width
Power ISA 3.0 and later cnttzw/cnttzd [14] Count Trailing Zeros32, 64ctzOperand width
RISC-V ("B" Extension)clz [15] Count Leading Zeros32, 64clzOperand width
ctz [15] Count Trailing Zeros32, 64ctzOperand width
SPARC Oracle Architecture 2011 and later lzcnt (synonym: lzd) [16] Leading Zero Count64clz64
VAX ffs [17] Find First Set0–32ctzOperand width; sets zero flag
IBM z/Architecture flogr [18] Find Leftmost One64clz64
vclz [18] Vector Count Leading Zeroes8, 16, 32, 64clzOperand width
vctz [18] Vector Count Trailing Zeroes8, 16, 32, 64ctzOperand width

On some Alpha platforms CTLZ and CTTZ are emulated in software.

Tool and library support

A number of compiler and library vendors supply compiler intrinsics or library functions to perform find first set and/or related operations, which are frequently implemented in terms of the hardware instructions above:

Tool/libraryNameTypeInput type(s)NotesOn application to 0
POSIX.1 compliant libc
4.3BSD libc
OS X 10.3 libc [2] [19]
ffsLibrary functionintIncludes glibc. POSIX does not supply the complementary log base 2 / clz.0
FreeBSD 5.3 libc
OS X 10.4 libc [19]
ffsl
fls
flsl
Library functionint,
long
fls("find last set") computes (log base 2) + 1.0
FreeBSD 7.1 libc [20] ffsll
flsll
Library functionlong long0
GCC 3.4.0 [21] [22]
Clang 5.x [23] [24]
__builtin_ffs[l,ll,imax]
__builtin_clz[l,ll,imax]
__builtin_ctz[l,ll,imax]
Built-in functionsunsigned int,
unsigned long,
unsigned long long,
uintmax_t
GCC documentation considers result undefined clz and ctz on 0.0 (ffs)
Visual Studio 2005_BitScanForward [25]
_BitScanReverse [26]
Compiler intrinsicsunsigned long,
unsigned __int64
Separate return value to indicate zero inputUndefined
Visual Studio 2008__lzcnt [27] Compiler intrinsicunsigned short,
unsigned int,
unsigned __int64
Relies on hardware support for the lzcnt instruction introduced in BMI1 or ABM.Operand width
Visual Studio 2012_arm_clz [28] Compiler intrinsicunsigned intRelies on hardware support for the clz instruction introduced in the ARMv5T architecture and later.?
Intel C++ Compiler _bit_scan_forward
_bit_scan_reverse [29] [30]
Compiler intrinsicsintUndefined
Nvidia CUDA [31] __clzFunctions32-bit, 64-bitCompiles to fewer instructions on the GeForce 400 series 32
__ffs0
LLVM llvm.ctlz.*
llvm.cttz.* [32]
Intrinsic8, 16, 32, 64, 256LLVM assembly languageOperand width, if 2nd argument is 0; undefined otherwise
GHC 7.10 (base 4.8), in Data.Bits[ citation needed ]countLeadingZeros
countTrailingZeros
Library functionFiniteBits b => bHaskell programming languageOperand width
C++20 standard library, in header <bit> [33] [34] bit_ceil bit_floor
bit_width
countl_zero countl_one
countr_zero countr_one
Library functionunsigned char,
unsigned short,
unsigned int,
unsigned long,
unsigned long long

Properties and relations

If bits are labeled starting at 1 (which is the convention used in this article), then count trailing zeros and find first set operations are related by ctz(x) = ffs(x) − 1 (except when the input is zero). If bits are labeled starting at 0, then count trailing zeros and find first set are exactly equivalent operations. Given w bits per word, the log2 is easily computed from the clz and vice versa by log2(x) = w − 1 − clz(x).

As demonstrated in the example above, the find first zero, count leading ones, and count trailing ones operations can be implemented by negating the input and using find first set, count leading zeros, and count trailing zeros. The reverse is also true.

On platforms with an efficient log2 operation such as M68000, ctz can be computed by:

ctz(x) = log2(x & −x)

where & denotes bitwise AND and −x denotes the two's complement of x. The expression x & −x clears all but the least-significant 1 bit, so that the most- and least-significant 1 bit are the same.

On platforms with an efficient count leading zeros operation such as ARM and PowerPC, ffs can be computed by:

ffs(x) = w − clz(x & −x).

Conversely, on machines without log2 or clz operators, clz can be computed using ctz, albeit inefficiently:

clz = w − ctz(2⌈log2(x)⌉) (which depends on ctz returning w for the zero input)

On platforms with an efficient Hamming weight (population count) operation such as SPARC's POPC [35] [36] or Blackfin's ONES, [37] there is:

ctz(x) = popcount((x & −x) − 1), [38] [39] or ctz(x) = popcount(~(x | −x)),
ffs(x) = popcount(x ^ ~−x) [35]
clz = 32 − popcount(2⌈log2(x)⌉ − 1)

where ^ denotes bitwise exclusive-OR, | denotes bitwise OR and ~ denotes bitwise negation.

The inverse problem (given i, produce an x such that ctz(x) = i) can be computed with a left-shift (1 << i).

Find first set and related operations can be extended to arbitrarily large bit arrays in a straightforward manner by starting at one end and proceeding until a word that is not all-zero (for ffs, ctz, clz) or not all-one (for ffz, clo, cto) is encountered. A tree data structure that recursively uses bitmaps to track which words are nonzero can accelerate this.

Software emulation

Most CPUs dating from the late 1980s onward have bit operators for ffs or equivalent, but a few modern ones like some of the ARM-Mx series do not. In lieu of hardware operators for ffs, clz and ctz, software can emulate them with shifts, integer arithmetic and bitwise operators. There are several approaches depending on architecture of the CPU and to a lesser extent, the programming language semantics and compiler code generation quality. The approaches may be loosely described as linear search, binary search, search+table lookup, de Bruijn multiplication, floating point conversion/exponent extract, and bit operator (branchless) methods. There are tradeoffs between execution time and storage space as well as portability and efficiency.

Software emulations are usually deterministic. They return a defined result for all input values; in particular, the result for an input of all zero bits is usually 0 for ffs, and the bit length of the operand for the other operations.

If one has a hardware clz or equivalent, ctz can be efficiently computed with bit operations, but the converse is not true: clz is not efficient to compute in the absence of a hardware operator.

2n

The function 2⌈log2(x)⌉ (round up to the nearest power of two) using shifts and bitwise ORs [40] is not efficient to compute as in this 32-bit example and even more inefficient if we have a 64-bit or 128-bit operand:

function pow2(x):     if x = 0 return invalid  // invalid is implementation defined (not in [0,63])     x ← x - 1     for each y in {1, 2, 4, 8, 16}: x ← x | (x >> y)     return x + 1

FFS

Since ffs = ctz + 1 (POSIX) or ffs = ctz (other implementations), the applicable algorithms for ctz may be used, with a possible final step of adding 1 to the result, and returning 0 instead of the operand length for input of all zero bits.

CTZ

The canonical algorithm is a loop counting zeros starting at the LSB until a 1-bit is encountered:

function ctz1 (x)     if x = 0 return w     t ← 1     r ← 0     while (x & t) = 0         t ← t << 1         r ← r + 1     return r

This algorithm executes O(n) time and operations, and is impractical in practice due to a large number of conditional branches.

A lookup table can eliminate most branches:

table[0..2n-1] = ctz(i) for i in 0..2n-1 function ctz2 (x)     if x = 0 return w     r ← 0     loopif (x & (2n-1)) ≠ 0             return r + table[x & (2n-1)]         x ← x >> n         r ← r + n

The parameter n is fixed (typically 8) and represents a time–space tradeoff. The loop may also be fully unrolled. But as a linear lookup, this approach is still O(n) in the number of bits in the operand.

A binary search implementation takes a logarithmic number of operations and branches, as in this 32-bit version: [41] [42] This algorithm can be assisted by a table as well, replacing the bottom three "if" statements with a 256 entry lookup table using the first non-zero byte encountered as an index.

function ctz3 (x)     if x = 0 return 32     n ← 0     if (x & 0x0000FFFF) = 0: n ← n + 16, x ← x >> 16     if (x & 0x000000FF) = 0: n ← n +  8, x ← x >>  8     if (x & 0x0000000F) = 0: n ← n +  4, x ← x >>  4     if (x & 0x00000003) = 0: n ← n +  2, x ← x >>  2     if (x & 0x00000001) = 0: n ← n +  1     return n

If the hardware has a clz operator, the most efficient approach to computing ctz is thus:

function ctz4 (x)     x &= -x     return w - (clz(x) + 1)

An algorithm for 32-bit ctz uses de Bruijn sequences to construct a minimal perfect hash function that eliminates all branches. [43] [44] This algorithm assumes that the result of the multiplication is truncated to 32 bit.

for i from 0 to 31: table[ ( 0x077CB531 * ( 1 << i ) ) >> 27 ] ← i  // table [0..31] initialized function ctz5 (x)     return table[((x & -x) * 0x077CB531) >> 27]

The expression (x & -x) again isolates the least-significant 1 bit. There are then only 32 possible words, which the unsigned multiplication and shift hash to the correct position in the table. (This algorithm does not handle the zero input.)

CLZ

The canonical algorithm examines one bit at a time starting from the MSB until a non-zero bit is found, as shown in this example. It executes in O(n) time where n is the bit-length of the operand, and is not a practical algorithm for general use.

function clz1 (x)     if x = 0 return w     t ← 1 << (w - 1)     r ← 0     while (x & t) = 0         t ← t >> 1         r ← r + 1     return r

An improvement on the previous looping approach examines eight bits at a time then uses a 256 (28) entry lookup table for the first non-zero byte. This approach, however, is still O(n) in execution time.

function clz2 (x)     if x = 0 return w     t ← 0xff << (w - 8)     r ← 0     while (x & t) = 0         t ← t >> 8         r ← r + 8     return r + table[x >> (w - 8 - r)]

Binary search can reduce execution time to O(log2n):

function clz3 (x)     if x = 0 return 32     n ← 0     if (x & 0xFFFF0000) = 0: n ← n + 16, x ← x << 16     if (x & 0xFF000000) = 0: n ← n +  8, x ← x <<  8     if (x & 0xF0000000) = 0: n ← n +  4, x ← x <<  4     if (x & 0xC0000000) = 0: n ← n +  2, x ← x <<  2     if (x & 0x80000000) = 0: n ← n +  1     return n

The fastest portable approaches to simulate clz are a combination of binary search and table lookup: an 8-bit table lookup (28=256 1-byte entries) can replace the bottom 3 branches in binary search. 64-bit operands require an additional branch. A larger width lookup can be used but the maximum practical table size is limited by the size of L1 data cache on modern processors, which is 32 KB for many. Saving a branch is more than offset by the latency of an L1 cache miss.

An algorithm similar to de Bruijn multiplication for CTZ works for CLZ, but rather than isolating the most-significant bit, it rounds up to the nearest integer of the form 2n1 using shifts and bitwise ORs: [45]

table[0..31] = {0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,                 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31} function clz4 (x)     for each y in {1, 2, 4, 8, 16}: x ← x | (x >> y)     return table[((x * 0x07C4ACDD) >> 27) % 32]

For processors with deep pipelines, like Prescott and later Intel processors, it may be faster to replace branches by bitwise AND and OR operators (even though many more instructions are required) to avoid pipeline flushes for mispredicted branches (and these types of branches are inherently unpredictable):

function clz5 (x)     r = (x > 0xFFFF) << 4; x >>= r;     q = (x > 0xFF  ) << 3; x >>= q; r |= q;     q = (x > 0xF   ) << 2; x >>= q; r |= q;     q = (x > 0x3   ) << 1; x >>= q; r |= q;                                     r |= (x >> 1);     return r;

On platforms that provide hardware conversion of integers to floating point, the exponent field can be extracted and subtracted from a constant to compute the count of leading zeros. Corrections are needed to account for rounding errors. [41] [46] Floating point conversion can have substantial latency. This method is highly non-portable and not usually recommended.

intx;intr;union{unsignedintu[2];doubled;}t;t.u[LE]=0x43300000;// LE is 1 for little-endiant.u[!LE]=x;t.d-=4503599627370496.0;r=(t.u[LE]>>20)-0x3FF;// log2r++;// CLZ

Applications

The count leading zeros (clz) operation can be used to efficiently implement normalization, which encodes an integer as m × 2e, where m has its most significant bit in a known position (such as the highest position). This can in turn be used to implement Newton–Raphson division, perform integer to floating point conversion in software, and other applications. [41] [47]

Count leading zeros (clz) can be used to compute the 32-bit predicate "x = y" (zero if true, one if false) via the identity clz(x − y) >> 5, where ">>" is unsigned right shift. [48] It can be used to perform more sophisticated bit operations like finding the first string of n 1 bits. [49] The expression clz(x − y)1 << (16  clz(x  1)/2) is an effective initial guess for computing the square root of a 32-bit integer using Newton's method. [50] CLZ can efficiently implement null suppression, a fast data compression technique that encodes an integer as the number of leading zero bytes together with the nonzero bytes. [51] It can also efficiently generate exponentially distributed integers by taking the clz of uniformly random integers. [41]

The log base 2 can be used to anticipate whether a multiplication will overflow, since ⌈log2(xy)⌉ ≤ ⌈log2(x)⌉ + ⌈log2(y)⌉. [52]

Count leading zeros and count trailing zeros can be used together to implement Gosper's loop-detection algorithm, [53] which can find the period of a function of finite range using limited resources. [42]

The binary GCD algorithm spends many cycles removing trailing zeros; this can be replaced by a count trailing zeros (ctz) followed by a shift. A similar loop appears in computations of the hailstone sequence.

A bit array can be used to implement a priority queue. In this context, find first set (ffs) is useful in implementing the "pop" or "pull highest priority element" operation efficiently. The Linux kernel real-time scheduler internally uses sched_find_first_bit() for this purpose. [54]

The count trailing zeros operation gives a simple optimal solution to the Tower of Hanoi problem: the disks are numbered from zero, and at move k, disk number ctz(k) is moved the minimum possible distance to the right (circling back around to the left as needed). It can also generate a Gray code by taking an arbitrary word and flipping bit ctz(k) at step k. [42]

See also

Notes

  1. Using bit operations on other than an unsigned machine word may yield undefined results.
  2. These four operations also have (much less common) negated versions:
    • find first zero (ffz), which identifies the index of the least significant zero bit;
    • count trailing ones, which counts the number of one bits following the least significant zero bit.
    • count leading ones, which counts the number of one bits preceding the most significant zero bit;
    • find the index of the most significant zero bit, which is an inverted version of the binary logarithm.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

<span class="mw-page-title-main">XOR swap algorithm</span> Binary arithmetic algorithm

In computer programming, the exclusive or swap is an algorithm that uses the exclusive or bitwise operation to swap the values of two variables without using the temporary variable which is normally required.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

<span class="mw-page-title-main">Binary logarithm</span> Exponent of a power of two

In mathematics, the binary logarithm is the power to which the number 2 must be raised to obtain the value n. That is, for any real number x,

In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.

Two's complement is the most common method of representing signed integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the greatest place value as the sign to indicate whether the binary number is positive or negative. When the most significant bit is 1, the number is signed as negative; and when the most significant bit is 0 the number is signed as positive.

The AKS primality test is a deterministic primality-proving algorithm created and published by Manindra Agrawal, Neeraj Kayal, and Nitin Saxena, computer scientists at the Indian Institute of Technology Kanpur, on August 6, 2002, in an article titled "PRIMES is in P". The algorithm was the first one which is able to determine in polynomial time, whether a given number is prime or composite and this without relying on mathematical conjectures such as the generalized Riemann hypothesis. The proof is also notable for not relying on the field of analysis. In 2006 the authors received both the Gödel Prize and Fulkerson Prize for their work.

In modular arithmetic computation, Montgomery modular multiplication, more commonly referred to as Montgomery multiplication, is a method for performing fast modular multiplication. It was introduced in 1985 by the American mathematician Peter L. Montgomery.

In number theory, the integer square root (isqrt) of a non-negative integer n is the non-negative integer m which is the greatest integer less than or equal to the square root of n,

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string, or the digit sum of the binary representation of a given number and the ₁ norm of a bit vector. In this binary case, it is also called the population count, popcount, sideways sum, or bit summation.

A bit array is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.

Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, data compression, encryption algorithms, and optimization. For most other tasks, modern programming languages allow the programmer to work directly with abstractions instead of bits that represent those abstractions.

A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of Euclidean division. Some are applied by hand, while others are employed by digital circuit designs and software.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

A counter machine is an abstract machine used in a formal logic and theoretical computer science to model computation. It is the most primitive of the four types of register machines. A counter machine comprises a set of one or more unbounded registers, each of which can hold a single non-negative integer, and a list of arithmetic and control instructions for the machine to follow. The counter machine is typically used in the process of designing parallel algorithms in relation to the mutual exclusion principle. When used in this manner, the counter machine is used to model the discrete time-steps of a computational system in relation to memory accesses. By modeling computations in relation to the memory accesses for each respective computational step, parallel algorithms may be designed in such a matter to avoid interlocking, the simultaneous writing operation by two threads to the same memory address.

Dynamic Markov compression (DMC) is a lossless data compression algorithm developed by Gordon Cormack and Nigel Horspool. It uses predictive arithmetic coding similar to prediction by partial matching (PPM), except that the input is predicted one bit at a time. DMC has a good compression ratio and moderate speed, similar to PPM, but requires somewhat more memory and is not widely implemented. Some recent implementations include the experimental compression programs hook by Nania Francesco Antonio, ocamyd by Frank Schwellinger, and as a submodel in paq8l by Matt Mahoney. These are based on the 1993 implementation in C by Gordon Cormack.

<span class="mw-page-title-main">Arithmetic logic unit</span> Combinational digital circuit

In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

References

  1. Anderson. Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way).
  2. 1 2 "FFS(3)". Linux Programmer's Manual. The Linux Kernel Archives. Retrieved 2012-01-02.
  3. "ARM Instruction Reference > ARM general data processing instructions > CLZ". ARM Developer Suite Assembler Guide. ARM . Retrieved 2012-01-03.
  4. "AVR32 Architecture Document" (PDF) (CORP072610 ed.). Atmel Corporation. 2011. 32000D–04/201. Archived from the original (PDF) on 2017-10-25. Retrieved 2016-10-22.
  5. 1 2 Alpha Architecture Reference Manual (PDF). Compaq. 2002. pp. 4-32, 4-34.
  6. 1 2 Intel 64 and IA-32 Architectures Software Developer Manual. Vol. 2A. Intel. pp. 3-92–3-97. Order number 325383.
  7. AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions (PDF). Vol. 3. Advanced Micro Devices (AMD). 2011. pp. 204–205. Publication No. 24594.
  8. "AMD64 Architecture Programmer's Manual, Volume 3: General-Purpose and System Instructions" (PDF). AMD64 Technology (Version 3.28 ed.). Advanced Micro Devices (AMD). September 2019 [2013]. Publication No. 24594. Archived (PDF) from the original on 2019-09-30. Retrieved 2014-01-02.
  9. Intel Itanium Architecture Software Developer's Manual. Volume 3: Intel Itanium Instruction Set. Vol. 3. Intel. 2010. pp. 3:38. Archived from the original on 2019-06-26.
  10. 1 2 MIPS Architecture For Programmers. Volume II-A: The MIPS32 Instruction Set (Revision 3.02 ed.). MIPS Technologies. 2011. pp. 101–102.
  11. 1 2 MIPS Architecture For Programmers. Volume II-A: The MIPS64 Instruction Set (Revision 3.02 ed.). MIPS Technologies. 2011. pp. 105, 107, 122, 123.
  12. M68000 Family Programmer's Reference Manual (Includes CPU32 Instructions) (PDF) (revision 1 ed.). Motorola. 1992. pp. 4-43–4-45. M68000PRM/AD. Archived from the original (PDF) on 2019-12-08.
  13. Frey, Brad. "Chapter 3.3.11 Fixed-Point Logical Instructions". PowerPC Architecture Book (Version 2.02 ed.). IBM. p. 70.
  14. "Chapter 3.3.13 Fixed-Point Logical Instructions - Chapter 3.3.13.1 64-bit Fixed-Point Logical Instructions". Power ISA Version 3.0B. IBM. pp. 95, 98.
  15. 1 2 Wolf, Clifford (2019-03-22). "RISC-V "B" Bit Manipulation Extension for RISC-V" (PDF). Github (Draft) (v0.37 ed.). Retrieved 2020-01-09.
  16. Oracle SPARC Architecture 2011. Oracle. 2011.
  17. VAX Architecture Reference Manual (PDF). Digital Equipment Corporation (DEC). 1987. pp. 70–71. Archived (PDF) from the original on 2019-09-29. Retrieved 2020-01-09.
  18. 1 2 3 "Chapter 22. Vector Integer Instructions". IBM z/Architecture Principles of Operation (PDF) (Eleventh ed.). IBM. March 2015. pp. 7-219–22-10. SA22-7832-10. Retrieved 2020-01-10.
  19. 1 2 "FFS(3)". Mac OS X Developer Library. Apple, Inc. 1994-04-19. Retrieved 2012-01-04.
  20. "FFS(3)". FreeBSD Library Functions Manual. The FreeBSD Project. Retrieved 2012-01-04.
  21. "Other built-in functions provided by GCC". Using the GNU Compiler Collection (GCC). Free Software Foundation, Inc. Retrieved 2015-11-14.
  22. "GCC 3.4.0 ChangeLog". GCC 3.4.0. Free Software Foundation, Inc. Retrieved 2015-11-14.
  23. "Clang Language Extensions - chapter Builtin Functions". The Clang Team. Retrieved 2017-04-09. Clang supports a number of builtin library functions with the same syntax as GCC
  24. "Source code of Clang". LLVM Team, University of Illinois at Urbana-Champaign. Retrieved 2017-04-09.
  25. "_BitScanForward, _BitScanForward64". Visual Studio 2008: Visual C++: Compiler Intrinsics. Microsoft . Retrieved 2018-05-21.
  26. "_BitScanReverse, _BitScanReverse64". Visual Studio 2008: Visual C++: Compiler Intrinsics. Microsoft . Retrieved 2018-05-21.
  27. "__lzcnt16, __lzcnt, __lzcnt64". Visual Studio 2008: Visual C++: Compiler Intrinsics. Microsoft . Retrieved 2012-01-03.
  28. "ARM intrinsics". Visual Studio 2012: Visual C++: Compiler Intrinsics. Microsoft . Retrieved 2022-05-09.
  29. "Intel Intrinsics Guide". Intel . Retrieved 2020-04-03.
  30. Intel C++ Compiler for Linux Intrinsics Reference. Intel. 2006. p. 21.
  31. NVIDIA CUDA Programming Guide (PDF) (Version 3.0 ed.). NVIDIA. 2010. p. 92.
  32. "'llvm.ctlz.*' Intrinsic, 'llvm.cttz.*' Intrinsic". LLVM Language Reference Manual. The LLVM Compiler Infrastructure. Retrieved 2016-02-23.
  33. Smith, Richard (2020-04-01). N4861 Working Draft, Standard for Programming Language C++ (PDF). ISO/IEC. pp. 1150–1153. Retrieved 2020-05-25.
  34. "Standard library header <bit>". cppreference.com. Retrieved 2020-05-25.
  35. 1 2 SPARC International, Inc. (1992). "A.41: Population Count. Programming Note" (PDF). The SPARC architecture manual: version 8 (Version 8 ed.). Englewood Cliffs, New Jersey, USA: Prentice Hall. pp.  231. ISBN   978-0-13-825001-0.
  36. Warren, Jr., Henry S. (2013) [2002]. Hacker's Delight (2 ed.). Addison Wesley - Pearson Education, Inc. ISBN   978-0-321-84268-8. 0-321-84268-5.
  37. Blackfin Instruction Set Reference (Preliminary ed.). Analog Devices. 2001. pp. 8–24. Part Number 82-000410-14.
  38. Dietz, Henry Gordon. "The Aggregate Magic Algorithms". University of Kentucky. Archived from the original on 2019-10-31.
  39. Isenberg, Gerd (2019-11-03) [2012]. "BitScanProtected". Chess Programming Wiki (CPW). Archived from the original on 2020-01-09. Retrieved 2020-01-09.
  40. Anderson. Round up to the next highest power of 2.
  41. 1 2 3 4 Warren. Chapter 5-3: Counting Leading 0's.
  42. 1 2 3 Warren. Chapter 5-4: Counting Trailing 0's.
  43. Leiserson, Charles E.; Prokop, Harald; Randall, Keith H. (1998-07-07). "Using de Bruijn Sequences to Index a 1 in a Computer Word" (PDF). MIT Laboratory for Computer Science, Cambridge, MA, USA. Archived (PDF) from the original on 2020-01-09. Retrieved 2020-01-09.
  44. Busch, Philip (2009-03-01) [2009-02-21]. "Computing Trailing Zeros HOWTO" (PDF). Archived (PDF) from the original on 2016-08-01. Retrieved 2020-01-09.
  45. Anderson. Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup.
  46. Anderson. Find the integer log base 2 of an integer with a 64-bit IEEE float.
  47. Sloss, Andrew N.; Symes, Dominic; Wright, Chris (2004). ARM system developer's guide designing and optimizing system software (1 ed.). San Francisco, CA, USA: Morgan Kaufmann. pp. 212–213. ISBN   978-1-55860-874-0.
  48. Warren. Chapter 2-11: Comparison Predicates.
  49. Warren. Chapter 6-2: Find First String of 1-Bits of a Given Length.
  50. Warren. Chapter 11-1: Integer Square Root.
  51. Schlegel, Benjamin; Gemulla, Rainer; Lehner, Wolfgang [in German] (June 2010). "Fast integer compression using SIMD instructions". Proceedings of the Sixth International Workshop on Data Management on New Hardware. pp. 34–40. CiteSeerX   10.1.1.230.6379 . doi:10.1145/1869389.1869394. ISBN   978-1-45030189-3. S2CID   7545142.
  52. Warren. Chapter 2-12: Overflow Detection.
  53. Gosper, Bill (April 1995) [1972-02-29]. Baker, Henry Givens Jr. (ed.). "Loop detector". HAKMEM (retyped & converted ed.). Cambridge, Massachusetts, USA: Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT). AI Memo 239 Item 132. Archived from the original on 2019-10-08. Retrieved 2020-01-09.
  54. Aas, Josh (2005-02-17). Understanding the Linux 2.6.8.1 CPU Scheduler (PDF). Silicon Graphics, Inc. (SGI). p. 19. Archived (PDF) from the original on 2017-05-19. Retrieved 2020-01-09.

Further reading