Scale factor (computer science)

Last updated

In computer science, a scale factor is a number used as a multiplier to represent a number on a different scale, functioning similarly to an exponent in mathematics. A scale factor is used when a real-world set of numbers needs to be represented on a different scale in order to fit a specific number format. Although using a scale factor extends the range of representable values, it also decreases the precision, resulting in rounding error for certain calculations.

Contents

Uses

Certain number formats may be chosen for an application for convenience in programming, or because of certain advantages offered by the hardware for that number format. For instance, early processors did not natively support the IEEE floating-point standard for representing fractional values, so integers were used to store representations of the real world values by applying a scale factor to the real value. Similarly, because hardware arithmetic has a fixed width (commonly 16, 32, or 64 bits, depending on the data type), scale factors allow representation of larger numbers (by manually multiplying or dividing by the specified scale factor), though at the expense of precision. [1] By necessity, this was done in software, since the hardware did not support fractional value. Scale factors are also used in floating-point numbers, and most commonly are powers of two. For example, the double-precision format sets aside 11 bits for the scaling factor (a binary exponent) and 53 bits for the significand, allowing various degrees of precision for representing different ranges of numbers, and expanding the range of representable numbers beyond what could be represented using 64 explicit bits (though at the cost of precision). [2]

As an example of where precision is lost, a 16-bit unsigned integer (uint16) can only hold a value as large as 65,53510. If unsigned 16-bit integers are used to represent values from 0 to 131,07010, then a scale factor of 12 would be introduced, such that the scaled values correspond exactly to the real-world even integers. As a consequence, for example, the number 3 cannot be represented, because a stored 1 represents a real-world 2, and a stored 2 represents a real-world 4; there are not enough bits available to avoid this error in this representation.

Operations on scaled values

Once the scaled representation of a real value is stored, the scaling can often be ignored until the value needs to come back into the "real world". For instance, adding two scaled values is just as valid as unscaling the values, adding the real values, and then scaling the result, and the former is much easier and faster. In either approach, though, the two added numbers must be scaled the same. [3] For other operations, the scaling is very important.

Multiplication, for instance, needs to take into account that both numbers are scaled. As an example, consider two real world values A and B. The real world multiplication of these real world values is:

A * B = P 

If they are instead represented with a scale factor of Z, and these scaled representations are subsequently multiplied, the result is the following:

AZ * BZ = Q 

AZ is the scaled real world value of A, or simply the product of A * Z, and likewise, BZ is the scaled representation of B. After the scaled multiplication, the answer is not written PZ, because the value stored in PZ is not the answer. This can be seen by rearranging the statement, where each line in the following is equivalent:

AZ * BZ = Q A * Z * B * Z = Q (A * B) * Z * Z = Q P * Z * Z = Q PZ * Z = Q 

In line 4, P substitutes A * B. It follows that the result of AZ * BZ (which is Q) is notPZ, but rather PZ * Z. If PZ were the answer, it could be stored directly since it has the scale factor built in, as is the case with addition and subtraction. For multiplication, however, the product of two scaled values has an extra scaling built in. As long as this is taken into account, there is still no need to convert AZ and BZ into A and B before performing the operation; the result must be divided by Z before storing it back. After this, PZ will be stored as the result of the multiplication, which is indeed the scaled representation of the result of A * B (the desired answer) rather than the result of AZ * BZ (which is still scaled).

Common scaling scenarios

Fractional values scaled to integers

As previously described, many older processors (and possibly some current ones) do not natively support fractional arithmetic. In this case, fractional values can be scaled into integers by multiplying them by ten to the power of whatever decimal precision is desired. In other words, to preserve n digits to the right of the decimal point, it is necessary to multiply the entire number by 10n. In computers, which perform calculations in binary, the real number is multiplied by 2m to preserve m digits to the right of the binary point; alternatively, one can bit shift the value m places to the left. For example, in the following set of real world fractional values, all have three digits to the right of the decimal point:

15.400, 0.133, 4.650, 1.000, 8.001 

To save all of that information (in other words, not lose any precision), these numbers must be multiplied by 103 (1,000), giving integer values of:

15400, 133, 4650, 1000, 8001 

Because of the value of the scaled numbers, they cannot be stored in 8bit integers; they will require at least 14 unsigned bits, or, more realistically, 16.

Integer values to fractions

Certain processors, particularly DSPs common in the embedded system industry, have built in support for the fixed-point arithmetic, such as Q and IQ formats.

Since the fractional part of a number takes up some bits in the field, the range of values possible in a fixed9point value is less than the same number of bits would provide to an integer. [4] For instance, in an 8 bit field, an unsigned integer can store values from [0, 255], but an unsigned fixed-point with 5 bits allocated to the fractional part only has 3 bits left over for the integer value, and so can only store integer values from [0, 7]. (The number of distinct values that the two fields can store is the same, 28 = 256, because the fixed-point field can also store 32 fractional values for each integer value.) It is therefore common that a scaling factor is used to store real world values that may be larger than the maximum value of the fixed-point format.

As an example, when using an unsigned 8-bit fixed-point format (which has 4 integer bits and 4 fractional bits), the highest representable integer value is 15, and the highest representable mixed value is 15.9375 (0xF.F or 1111.1111b). If the desired real world values are in the range [0,160], they must be scaled to fit within this fixed-point representation. A scale factor of 110cannot be used here, because scaling 160 by 110 gives 16, which is greater than the greatest value that can be stored in this fixed-point format. However, 111 will work as a scale factor, because the maximum scaled value, 16011 = 14.54, fits within this range. Given this set:

154, 101, 54, 3, 0, 160 

Scaling these with the scale factor 111 gives the following values:

154/11 = 14 101/11 = 9.1818... 54/11 = 4.9090... 3/11 = 0.2727... 0/11 = 0 160/11 = 14.5454... 

Many of these values have been truncated because they contain repeating decimals, which follows from the chosen scale factor (elevenths do not terminate in decimal). When storing these in our fixed-point format, some precision will be lost (in contrast to the exact values of the original integers). This is also a problem because an 8-bit format can store 256 different values, but the numbers in this set are from a range with only 161 possible values (0 through 160). As it turns out, the problem was the scale factor, 111, which introduced unnecessary precision requirements and rounding error (when approximating a real value with the nearest representable value). [5] To avoid or resolve this problem, one must choose a better scale factor.

Choosing a scale factor

The example above illustrates how certain scale factors can cause unnecessary precision loss or rounding error, highlighting the importance of choosing the right scale factor. Using the scale factor of 111 and converting to binary representations, the following values are obtained:

154/11 = 14 = 1110.0 101/11 = 9.1818... = 1001.00101110... 54/11 = 4.9090... = 100.111010... 3/11 = 0.2727... = 0.010010... 0/11 = 0 = 0.0 160/11 = 14.5454... = 1110.10010... 

Several of the binary fractions require more than the four fractional bits provided by the set fixed-point format. (This is in part because elevenths do not terminate in binary either.) To fit them into the fields (4 integer and 4 fractional bits), it is possible to truncate the remaining bits, giving the following stored representations:

1110.0000 1001.0010 0100.1110 0000.0100 0000.0000 1110.1001 

Or in decimal:

14.0 9.125 4.875 0.25 0.0 14.5625 

When they are called back into the real world, they are divided by the scale factor, 111. This is the inverse of the original scaling, giving the following "real world" values:

154.0 100.375 53.625 2.75 0 160.1875 

These values are not equivalent to the originals (before scaling down and fitting into this 8-bit representation). Most noticeably, they are not all integers anymore, immediately indicating that an error was introduced in the storage, due to a poor choice of scaling factor.

Choosing a better scale factor

Most data sets will not have a perfect scale factor; most likely, there will be some error introduced by the scaling process. However, it may be possible to choose a better scale factor. The ideal scale factor may not be the smallest, but rather one that preserves as much precision as possible.

Dividing a number by a power of two is the same as shifting all the bits to the right once for each power of two. (This is the binary equivalent to shifting all decimal digits to the left or right when, respectively, multiplying or dividing by powers of ten.) The pattern of bits does not change, it just moves the number of places equal to the binary exponent (for instance, 3 places to the right when dividing by 8 = 23). On the other hand, when dividing by a number that is not an integer power of two in binary, the bit pattern changes. This is likely to produce a bit pattern with more bits to the right of the binary point, artificially introducing required precision. This is especially true when the fractional part has a denominator that is not a power of two, as all fractions not reciprocals of powers of two recur in binary. [6] Therefore, it is almost always preferable to use a scale factor that is a power of two. It may still be possible to lose bits that get shifted right off the end of the field as a result of truncation, but this avoids introducing new bits that will be imprecise (due to rounding error) or truncated. [6]

As an illustration the use of powers of two in the scale factor, a scale factor of 116 can be applied to the above data set. The binary values for the original data set are given below:

154 = 1001 1010 101 = 0110 0101 54 =  0011 0110 3 =   0000 0011 0 =   0000 0000 160 = 1010 0000 

Being integers between 0 and 255, these all can be represented precisely with 8 bits. Scaling these by 116 is the same as dividing by 16, which is the same as shifting the bits 4 places to the right. In this case, scaling is done by inserting a binary point between the first 4 bits and last 4 bits of each number. That happens to equal the predetermined format of this representation. Consequently, since all these numbers do not require more than 8 bits to represent them as integers, no more than 8 bits are required to scale them down and store them in a fixed-point format.

See also

Related Research Articles

<span class="mw-page-title-main">Binary-coded decimal</span> System of digitally encoding numbers

In computing and electronic systems, binary-coded decimal (BCD) is a class of binary encodings of decimal numbers where each digit is represented by a fixed number of bits, usually four or eight. Sometimes, special bit patterns are used for a sign or other indications.

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision:

In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Two's complement is the most common method of representing signed integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the greatest place value as the sign to indicate whether the binary number is positive or negative. When the most significant bit is 1, the number is signed as negative; and when the most significant bit is 0 the number is signed as positive.

<span class="mw-page-title-main">Power of two</span> Two raised to an integer power

A power of two is a number of the form 2n where n is an integer, that is, the result of exponentiation with number two as the base and integer n as the exponent.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.

In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents. More generally, the term may refer to representing fractional values as integer multiples of some fixed small unit, e.g. a fractional amount of hours as an integer multiple of ten-minute intervals. Fixed-point number representation is often contrasted to the more complicated and computationally demanding floating-point representation.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, signed number representations are required to encode negative numbers in binary number systems.

In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are limited only by the available memory of the host system. This contrasts with the faster fixed-precision arithmetic found in most arithmetic logic unit (ALU) hardware, which typically offers between 8 and 64 bits of precision.

The Q notation is a way to specify the parameters of a binary fixed point number format. For example, in Q notation, the number format denoted by Q8.8 means that the fixed point numbers in this format have 8 bits for the integer part and 8 bits for the fraction part.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Unums are a family of number formats and arithmetic for implementing real numbers on a computer, proposed by John L. Gustafson in 2015. They are designed as an alternative to the ubiquitous IEEE 754 floating-point standard. The latest version is known as posits.

References

  1. Linz & Wang 2003, pp. 12–13.
  2. Linz & Wang 2003, pp. 14–15.
  3. Yates 2013, p. 6.
  4. Yates 2013, pp. 4–5.
  5. Linz & Wang 2003, p. 18.
  6. 1 2 "Binary Fractions". Floating-point-gui.de. Retrieved 6 July 2020.