Saturation arithmetic is a version of arithmetic in which all operations, such as addition and multiplication, are limited to a fixed range between a minimum and maximum value.
If the result of an operation is greater than the maximum, it is set ("clamped") to the maximum; if it is below the minimum, it is clamped to the minimum. The name comes from how the value becomes "saturated" once it reaches the extreme values; further additions to a maximum or subtractions from a minimum will not change the result.
For example, if the valid range of values is from −100 to 100, the following saturating arithmetic operations produce the following values:
Here is another example for saturating subtraction when the valid range is from 0 to 100 instead:
As can be seen from these examples, familiar properties like associativity and distributivity may fail in saturation arithmetic. [lower-alpha 1] This makes it unpleasant to deal with in abstract mathematics, but it has an important role to play in digital hardware and algorithms where values have maximum and minimum representable ranges.
Typically, general-purpose microprocessors do not implement integer arithmetic operations using saturation arithmetic; instead, they use the easier-to-implement modular arithmetic, in which values exceeding the maximum value "wrap around" to the minimum value, like the hours on a clock passing from 12 to 1. In hardware, modular arithmetic with a minimum of zero and a maximum of rn − 1, where r is the radix can be implemented by simply discarding all but the lowest n digits. For binary hardware, which the vast majority of modern hardware is, the radix is 2, and the digits are bits.
However, although more difficult to implement, saturation arithmetic has numerous practical advantages. The result is as numerically close to the true answer as possible; for 8-bit binary signed arithmetic, when the correct answer is 130, it is considerably less surprising to get an answer of 127 from saturating arithmetic than to get an answer of −126 from modular arithmetic. Likewise, for 8-bit binary unsigned arithmetic, when the correct answer is 258, it is less surprising to get an answer of 255 from saturating arithmetic than to get an answer of 2 from modular arithmetic.
Saturation arithmetic also enables overflow of additions and multiplications to be detected consistently without an overflow bit or excessive computation, by simple comparison with the maximum or minimum value (provided the datum is not permitted to take on these values).
Additionally, saturation arithmetic enables efficient algorithms for many problems, particularly in digital signal processing. For example, adjusting the volume level of a sound signal can result in overflow, and saturation causes significantly less distortion to the sound than wrap-around. In the words of researchers G. A. Constantinides et al.: [1]
When adding two numbers using two's complement representation, overflow results in a "wrap-around" phenomenon. The result can be a catastrophic loss in signal-to-noise ratio in a DSP system. Signals in DSP designs are therefore usually either scaled appropriately to avoid overflow for all but the most extreme input vectors, or produced using saturation arithmetic components.
Saturation arithmetic operations are available on many modern platforms, and in particular was one of the extensions made by the Intel MMX platform, specifically for such signal-processing applications. This functionality is also available in wider versions in the SSE2 and AVX2 integer instruction sets. It is also available in ARM NEON instruction set.
Saturation arithmetic for integers has also been implemented in software for a number of programming languages including C, C++, such as the GNU Compiler Collection, [2] LLVM IR, and Eiffel. This helps programmers anticipate and understand the effects of overflow better, and in the case of compilers usually pick the optimal solution.
Saturation is challenging to implement efficiently in software on a machine with only modular arithmetic operations, since simple implementations require branches that create huge pipeline delays. However, it is possible to implement saturating addition and subtraction in software without branches, using only modular arithmetic and bitwise logical operations that are available on all modern CPUs and their predecessors, including all x86 CPUs (back to the original Intel 8086) and some popular 8-bit CPUs (some of which, such as the Zilog Z80, are still in production). On the other hand, on simple 8-bit and 16-bit CPUs, a branching algorithm might actually be faster if programmed in assembly, since there are no pipelines to stall, and each instruction always takes multiple clock cycles. On the x86, which provides overflow flags and conditional moves, very simple branch-free code is possible. [3]
Although saturation arithmetic is less popular for integer arithmetic in hardware, the IEEE floating-point standard, the most popular abstraction for dealing with approximate real numbers, uses a form of saturation in which overflow is converted into "infinity" or "negative infinity", and any other operation on this result continues to produce the same value. This has the advantage over simple saturation that later operations decreasing the value will not end up producing a misleadingly "reasonable" result, such as in the computation . Alternatively, there may be special states such as "exponent overflow" (and "exponent underflow") that will similarly persist through subsequent operations, or cause immediate termination, or be tested for as in IF ACCUMULATOR OVERFLOW ...
as in FORTRAN for the IBM704 (October 1956).
In computing and electronic systems, binary-coded decimal (BCD) is a class of binary encodings of decimal numbers where each digit is represented by a fixed number of bits, usually four or eight. Sometimes, special bit patterns are used for a sign or other indications.
A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).
In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.
MIPS is a family of reduced instruction set computer (RISC) instruction set architectures (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.
In computer programming, the exclusive or swap is an algorithm that uses the exclusive or bitwise operation to swap the values of two variables without using the temporary variable which is normally required.
Two's complement is the most common method of representing signed integers on computers,and more generally, fixed point binary values. Two's complement uses the binary digit with the greatest place value as the sign to indicate whether the binary number is positive or negative. When the most significant bit is 1, the number is signed as negative; and when the most significant bit is 0 the number is signed as positive. The mathematical operation to reversibly convert a positive binary number into a negative binary number with equivalent negative value .
The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.
In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents. More generally, the term may refer to representing fractional values as integer multiples of some fixed small unit, e.g. a fractional amount of hours as an integer multiple of ten-minute intervals. Fixed-point number representation is often contrasted to the more complicated and computationally demanding floating-point representation.
In computing, signed number representations are required to encode negative numbers in binary number systems.
In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are limited only by the available memory of the host system. This contrasts with the faster fixed-precision arithmetic found in most arithmetic logic unit (ALU) hardware, which typically offers between 8 and 64 bits of precision.
The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string, or the digit sum of the binary representation of a given number and the ℓ₁ norm of a bit vector. In this binary case, it is also called the population count, popcount, sideways sum, or bit summation.
In computer architecture, 128-bit integers, memory addresses, or other data units are those that are 128 bits wide. Also, 128-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size.
In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower than the minimum representable value.
Many protocols and algorithms require the serialization or enumeration of related entities. For example, a communication protocol must know whether some packet comes "before" or "after" some other packet. The IETF RFC 1982 attempts to define "serial number arithmetic" for the purposes of manipulating and comparing these sequence numbers. In short, when the absolute serial number value decreases by more than half of the maximum value, it is considered to be "after" the former, whereas other decreases are considered to be "before".
In computer processors the carry flag is a single bit in a system status register/flag register used to indicate when an arithmetic carry or borrow has been generated out of the most significant arithmetic logic unit (ALU) bit position. The carry flag enables numbers larger than a single ALU width to be added/subtracted by carrying (adding) a binary digit from a partial addition/subtraction to the least significant bit position of a more significant word. This is typically programmed by the user of the processor on the assembly or machine code level, but can also happen internally in certain processors, via digital logic or microcode, where some processors have wider registers and arithmetic instructions than ALU. It is also used to extend bit shifts and rotates in a similar manner on many processors. For subtractive operations, two (opposite) conventions are employed as most machines set the carry flag on borrow while some machines instead reset the carry flag on borrow.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Kochanski multiplication is an algorithm that allows modular arithmetic to be performed efficiently when the modulus is large. This has particular application in number theory and in cryptography: for example, in the RSA cryptosystem and Diffie–Hellman key exchange.
Offset binary, also referred to as excess-K, excess-N, excess-e, excess code or biased representation, is a method for signed number representation where a signed number n is represented by the bit pattern corresponding to the unsigned number n+K, K being the biasing value or offset. There is no standard for offset binary, but most often the K for an n-bit binary word is K = 2n−1 (for example, the offset for a four-digit binary number would be 23=8). This has the consequence that the minimal negative value is represented by all-zeros, the "zero" value is represented by a 1 in the most significant bit and zero in all other bits, and the maximal positive value is represented by all-ones (conveniently, this is the same as using two's complement but with the most significant bit inverted). It also has the consequence that in a logical comparison operation, one gets the same result as with a true form numerical comparison operation, whereas, in two's complement notation a logical comparison will agree with true form numerical comparison operation if and only if the numbers being compared have the same sign. Otherwise the sense of the comparison will be inverted, with all negative values being taken as being larger than all positive values.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).