Fast inverse square root, sometimes referred to as Fast InvSqrt() or by the hexadecimal constant 0x5F3759DF, is an algorithm that estimates , the reciprocal (or multiplicative inverse) of the square root of a 32-bit floating-point number in IEEE 754 floating-point format. The algorithm is best known for its implementation in 1999 in Quake III Arena , a first-person shooter video game heavily based on 3D graphics. With subsequent hardware advancements, especially the x86 SSE instruction rsqrtss
, this algorithm is not generally the best choice for modern computers, [1] though it remains an interesting historical example. [2]
The algorithm accepts a 32-bit floating-point number as the input and stores a halved value for later use. Then, treating the bits representing the floating-point number as a 32-bit integer, a logical shift right by one bit is performed and the result subtracted from the number 0x5F3759DF, which is a floating-point representation of an approximation of . [3] This results in the first approximation of the inverse square root of the input. Treating the bits again as a floating-point number, it runs one iteration of Newton's method, yielding a more precise approximation.
William Kahan and K.C. Ng at Berkeley wrote an unpublished paper in May 1986 describing how to calculate the square root using bit-fiddling techniques followed by Newton iterations. [4] In the late 1980s, Cleve Moler at Ardent Computer learned about this technique [5] and passed it along to his coworker Greg Walsh. Greg Walsh devised the now-famous constant and fast inverse square root algorithm. Gary Tarolli was consulting for Kubota, the company funding Ardent at the time, and likely brought the algorithm to 3dfx Interactive circa 1994. [6] [7]
Jim Blinn demonstrated a simple approximation of the inverse square root in a 1997 column for IEEE Computer Graphics and Applications . [8] Reverse engineering of other contemporary 3D video games uncovered a variation of the algorithm in Activision's 1997 Interstate '76. [9]
Quake III Arena , a first-person shooter video game, was released in 1999 by id Software and used the algorithm. Brian Hook may have brought the algorithm from 3dfx to id Software. [6] A discussion of the code appeared on the Chinese developer forum CSDN in 2000, [10] and Usenet and the gamedev.net forum spread the code widely in 2002 and 2003. [11] Speculation arose as to who wrote the algorithm and how the constant was derived; some guessed John Carmack. [7] Quake III's full source code was released at QuakeCon 2005, but provided no answers. The authorship question was resolved in 2006 when Greg Walsh, the original author, contacted Beyond3D after their speculation gained popularity on Slashdot. [6]
In 2007 the algorithm was implemented in some dedicated hardware vertex shaders using field-programmable gate arrays (FPGA). [12] [13]
The inverse square root of a floating point number is used in digital signal processing to normalize a vector, scaling it to length 1 to produce a unit vector. [14] For example, computer graphics programs use inverse square roots to compute angles of incidence and reflection for lighting and shading. 3D graphics programs must perform millions of these calculations every second to simulate lighting. When the code was developed in the early 1990s, most floating point processing power lagged the speed of integer processing. [7] This was troublesome for 3D graphics programs before the advent of specialized hardware to handle transform and lighting. Computation of square roots usually depends upon many division operations, which for floating point numbers are computationally expensive. The fast inverse square generates a good approximation with only one division step.
The length of the vector is determined by calculating its Euclidean norm: the square root of the sum of squares of the vector components. When each component of the vector is divided by that length, the new vector will be a unit vector pointing in the same direction. In a 3D graphics program, all vectors are in three-dimensional space, so would be a vector .
is the Euclidean norm of the vector.
is the normalized (unit) vector. Substituting , the equation can also be written as:
which relates the unit vector to the inverse square root of the distance components. The inverse square root can be used to compute because this equation is equivalent to
where the fraction term is the inverse square root of .
At the time, floating-point division was generally expensive compared to multiplication; the fast inverse square root algorithm bypassed the division step, giving it its performance advantage.
The following C code is the fast inverse square root implementation from Quake III Arena , stripped of C preprocessor directives, but including the exact original comment text: [15]
floatQ_rsqrt(floatnumber){longi;floatx2,y;constfloatthreehalfs=1.5F;x2=number*0.5F;y=number;i=*(long*)&y;// evil floating point bit level hackingi=0x5f3759df-(i>>1);// what the fuck?y=*(float*)&i;y=y*(threehalfs-(x2*y*y));// 1st iteration// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removedreturny;}
At the time, the general method to compute the inverse square root was to calculate an approximation for , then revise that approximation via another method until it came within an acceptable error range of the actual result. Common software methods in the early 1990s drew approximations from a lookup table. [16] The key of the fast inverse square root was to directly compute an approximation by utilizing the structure of floating-point numbers, proving faster than table lookups. The algorithm was approximately four times faster than computing the square root with another method and calculating the reciprocal via floating-point division. [17] The algorithm was designed with the IEEE 754-1985 32-bit floating-point specification in mind, but investigation from Chris Lomont showed that it could be implemented in other floating-point specifications. [18]
The advantages in speed offered by the fast inverse square root trick came from treating the 32-bit floating-point word [note 1] as an integer, then subtracting it from a "magic" constant, 0x5F3759DF. [7] [19] [20] [21] This integer subtraction and bit shift results in a bit pattern which, when re-defined as a floating-point number, is a rough approximation for the inverse square root of the number. One iteration of Newton's method is performed to gain some accuracy, and the code is finished. The algorithm generates reasonably accurate results using a unique first approximation for Newton's method; however, it is much slower and less accurate than using the SSE instruction rsqrtss
on x86 processors also released in 1999. [1] [22]
As an example, the number can be used to calculate . The first steps of the algorithm are illustrated below:
0011_1110_0010_0000_0000_0000_0000_0000 Bit pattern of both x and i 0001_1111_0001_0000_0000_0000_0000_0000 Shift right one position: (i >> 1) 0101_1111_0011_0111_0101_1001_1101_1111 The magic number 0x5F3759DF 0100_0000_0010_0111_0101_1001_1101_1111 The result of 0x5F3759DF - (i >> 1)
Interpreting as IEEE 32-bit representation:
0_01111100_01000000000000000000000 1.25 × 2−3 0_00111110_00100000000000000000000 1.125 × 2−65 0_10111110_01101110101100111011111 1.432430... × 263 0_10000000_01001110101100111011111 1.307430... × 21
Reinterpreting this last bit pattern as a floating point number gives the approximation , which has an error of about 3.4%. After one iteration of Newton's method, the final result is , an error of only 0.17%.
According to the C standard, reinterpreting a floating point value as an integer by casting then dereferencing the pointer to it is not valid (undefined behavior). Another way would be to place the floating point value in an anonymous union containing an additional 32-bit unsigned integer member, and accesses to that integer provides a bit level view of the contents of the floating point value.
#include<stdint.h> // uint32_tfloatQ_rsqrt(floatnumber){union{floatf;uint32_ti;}conv={.f=number};conv.i=0x5f3759df-(conv.i>>1);conv.f*=1.5F-(number*0.5F*conv.f*conv.f);returnconv.f;}
For C++, however, type punning through a union is also undefined behavior. In modern C++, the recommended method for implementing this function's casts is through C++20's std::bit_cast
. This also allows the function to work in constexpr
context:
importstd;constexprstd::float32_tQ_rsqrt(std::float32_tnumber)noexcept{constautoy=std::bit_cast<std::float32_t>(0x5f3759df-(std::bit_cast<std::uint32_t>(number)>>1));returny*(1.5f32-(number*0.5f32*y*y));}
The algorithm computes by performing the following steps:
Since this algorithm relies heavily on the bit-level representation of single-precision floating-point numbers, a short overview of this representation is provided here. To encode a non-zero real number as a single precision float, the first step is to write as a normalized binary number: [23]
where the exponent is an integer, and is the binary representation of the significand. Since the single bit before the point in the significand is always 1, it does not need be stored. The equation can be rewritten as:
where means , so . From this form, three unsigned integers are computed: [24]
These fields are then packed, left to right, into a 32-bit container. [25]
As an example, consider again the number . Normalizing yields:
and thus, the three unsigned integer fields are:
these fields are packed as shown in the figure below:
If were to be calculated without a computer or a calculator, a table of logarithms would be useful, together with the identity , which is valid for every base . The fast inverse square root is based on this identity, and on the fact that aliasing a float32 to an integer gives a rough approximation of its logarithm. Here is how:
If is a positive normal number:
then
and since , the logarithm on the right-hand side can be approximated by [26]
where is a free parameter used to tune the approximation. For example, yields exact results at both ends of the interval, while yields the optimal approximation (the best in the sense of the uniform norm of the error). However, this value is not used by the algorithm as it does not take subsequent steps into account.
Thus there is the approximation
Interpreting the floating-point bit-pattern of as an integer yields [note 4]
It then appears that is a scaled and shifted piecewise-linear approximation of , as illustrated in the figure on the right. In other words, is approximated by
The calculation of is based on the identity
Using the approximation of the logarithm above, applied to both and , the above equation gives:
Thus, an approximation of is:
which is written in the code as
i=0x5f3759df-(i>>1);
The first term above is the magic number
from which it can be inferred that . The second term, , is calculated by shifting the bits of one position to the right. [27]
With as the inverse square root, . The approximation yielded by the earlier steps can be refined by using a root-finding method, a method that finds the zero of a function. The algorithm uses Newton's method: if there is an approximation, for , then a better approximation can be calculated by taking , where is the derivative of at . [28] For the above ,
where and .
Treating as a floating-point number, y = y*(threehalfs - x/2*y*y);
is equivalent to
By repeating this step, using the output of the function () as the input of the next iteration, the algorithm causes to converge to the inverse square root. [29] For the purposes of the Quake III engine, only one iteration was used. A second iteration remained in the code but was commented out. [21]
As noted above, the approximation is very accurate. The single graph on the right plots the error of the function (that is, the error of the approximation after it has been improved by running one iteration of Newton's method), for inputs starting at 0.01, where the standard library gives 10.0 as a result, and InvSqrt() gives 9.982522, making the relative difference 0.0017478, or 0.175% of the true value, 10. The absolute error only drops from then on, and the relative error stays within the same bounds across all orders of magnitude.
It is not known precisely how the exact value for the magic number was determined. Chris Lomont developed a function to minimize approximation error by choosing the magic number over a range. He first computed the optimal constant for the linear approximation step as 0x5F37642F, close to 0x5F3759DF, but this new constant gave slightly less accuracy after one iteration of Newton's method. [30] Lomont then searched for a constant optimal even after one and two Newton iterations and found 0x5F375A86, which is more accurate than the original at every iteration stage. [30] He concluded by asking whether the exact value of the original constant was chosen through derivation or trial and error. [31] Lomont said that the magic number for 64-bit IEEE754 size type double is 0x5FE6EC85E7DE30DA, but it was later shown by Matthew Robertson to be exactly 0x5FE6EB50C7B537A9. [32]
Jan Kadlec reduced the relative error by a further factor of 2.7 by adjusting the constants in the single Newton's method iteration as well, [33] arriving after an exhaustive search at
conv.i=0x5F1FFFF9-(conv.i>>1);conv.f*=0.703952253f*(2.38924456f-x*conv.f*conv.f);returnconv.f;
A complete mathematical analysis for determining the magic number is now available for single-precision floating-point numbers. [34] [35]
Intermediate to the use of one vs. two iterations of Newton's method in terms of speed and accuracy is a single iteration of Halley's method. In this case, Halley's method is equivalent to applying Newton's method with the starting formula . The update step is then
where the implementation should calculate only once, via a temporary variable.
Subsequent additions by hardware manufacturers have made this algorithm redundant for the most part. For example, on x86, Intel introduced the SSE instruction rsqrtss
in 1999. In a 2009 benchmark on the Intel Core 2, this instruction took 0.85ns per float compared to 3.54ns for the fast inverse square root algorithm, and had less error. [1]
Some low-cost embedded systems do not have specialized square root instructions. However, manufacturers of these systems usually provide trigonometric and other math libraries, based on algorithms such as CORDIC.
long
reduces the portability of this code on modern systems. For the code to execute properly, sizeof(long)
must be 4 bytes, otherwise negative outputs may result. Under many modern 64-bit systems, sizeof(long)
is 8 bytes. The more portable replacement is int32_t
.In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
In mathematics, a square root of a number x is a number y such that ; in other words, a number y whose square is x. For example, 4 and −4 are square roots of 16 because .
In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.
In quantum computing, Grover's algorithm, also known as the quantum search algorithm, is a quantum algorithm for unstructured search that finds with high probability the unique input to a black box function that produces a particular output value, using just evaluations of the function, where is the size of the function's domain. It was devised by Lov Grover in 1996.
The Box–Muller transform, by George Edward Pelham Box and Mervin Edgar Muller, is a random number sampling method for generating pairs of independent, standard, normally distributed random numbers, given a source of uniformly distributed random numbers. The method was first mentioned explicitly by Raymond E. A. C. Paley and Norbert Wiener in their 1934 treatise on Fourier transforms in the complex domain. Given the status of these latter authors and the widespread availability and use of their treatise, it is almost certain that Box and Muller were well aware of its contents.
In numerical analysis, a root-finding algorithm is an algorithm for finding zeros, also called "roots", of continuous functions. A zero of a function f is a number x such that f(x) = 0. As, generally, the zeros of a function cannot be computed exactly nor expressed in closed form, root-finding algorithms provide approximations to zeros. For functions from the real numbers to real numbers or from the complex numbers to the complex numbers, these are expressed either as floating-point numbers without error bounds or as floating-point values together with error bounds. The latter, approximations with error bounds, are equivalent to small isolating intervals for real roots or disks for complex roots.
In mathematics, a multiplicative inverse or reciprocal for a number x, denoted by 1/x or x−1, is a number which when multiplied by x yields the multiplicative identity, 1. The multiplicative inverse of a fraction a/b is b/a. For the multiplicative inverse of a real number, divide 1 by the number. For example, the reciprocal of 5 is one fifth (1/5 or 0.2), and the reciprocal of 0.25 is 1 divided by 0.25, or 4. The reciprocal function, the function f(x) that maps x to 1/x, is one of the simplest examples of a function which is its own inverse (an involution).
In mathematics, a cube root of a number x is a number y such that y3 = x. All nonzero real numbers have exactly one real cube root and a pair of complex conjugate cube roots, and all nonzero complex numbers have three distinct complex cube roots. For example, the real cube root of 8, denoted , is 2, because 23 = 8, while the other cube roots of 8 are and . The three cube roots of −27i are:
In mathematics, the Lucas–Lehmer test (LLT) is a primality test for Mersenne numbers. The test was originally developed by Édouard Lucas in 1878 and subsequently proved by Derrick Henry Lehmer in 1930.
In mathematics, tetration is an operation based on iterated, or repeated, exponentiation. There is no standard notation for tetration, though Knuth's up arrow notation and the left-exponent xb are common.
Rayleigh quotient iteration is an eigenvalue algorithm which extends the idea of the inverse iteration by using the Rayleigh quotient to obtain increasingly accurate eigenvalue estimates.
In number theory, the integer square root (isqrt) of a non-negative integer n is the non-negative integer m which is the greatest integer less than or equal to the square root of n,
In mathematical analysis, particularly numerical analysis, the rate of convergence and order of convergence of a sequence that converges to a limit are any of several characterizations of how quickly that sequence approaches its limit. These are broadly divided into rates and orders of convergence that describe how quickly a sequence further approaches its limit once it is already close to it, called asymptotic rates and orders of convergence, and those that describe how quickly sequences approach their limits from starting points that are not necessarily close to their limits, called non-asymptotic rates and orders of convergence.
Methods of computing square roots are algorithms for approximating the non-negative square root of a positive real number . Since all square roots of natural numbers, other than of perfect squares, are irrational, square roots can usually only be computed to some finite precision: these methods typically construct a series of increasingly accurate approximations.
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.
Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.
In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.
{{cite web}}
: CS1 maint: bot: original URL status unknown (link)