Computer number format

Last updated

A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. [1] Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer;[ citation needed ] the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.

Contents

Binary number representation

Computers represent data in sets of binary digits. The representation is composed of bits, which in turn are grouped into larger sets such as bytes.

Table 1: Binary to octal
Binary stringOctal value
0000
0011
0102
0113
1004
1015
1106
1117
Table 2: Number of values for a bit string.
Length of bit string (b)Number of possible values (N)
12
24
38
416
532
664
7128
8256
9512
101024
...

A bit is a binary digit that represents one of two states. The concept of a bit can be understood as a value of either 1 or 0, on or off, yes or no, true or false, or encoded by a switch or toggle of some kind.

While a single bit, on its own, is able to represent only two values, a string of bits may be used to represent larger values. For example, a string of three bits can represent up to eight distinct values as illustrated in Table 1.

As the number of bits composing a string increases, the number of possible 0 and 1 combinations increases exponentially. A single bit allows only two value-combinations, two bits combined can make four separate values, three bits for eight, and so on, increasing with the formula 2n. The amount of possible combinations doubles with each binary digit added as illustrated in Table 2.

Groupings with a specific number of bits are used to represent varying things and have specific names.

A byte is a bit string containing the number of bits needed to represent a character. On most modern computers, this is an eight bit string. Because the definition of a byte is related to the number of bits composing a character, some older computers have used a different bit length for their byte. [2] In many computer architectures, the byte is the smallest addressable unit, the atom of addressability, say. For example, even though 64-bit processors may address memory sixty-four bits at a time, they may still split that memory into eight-bit pieces. This is called byte-addressable memory. Historically, many CPUs read data in some multiple of eight bits. [3] Because the byte size of eight bits is so common, but the definition is not standardized, the term octet is sometimes used to explicitly describe an eight bit sequence.

A nibble (sometimes nybble), is a number composed of four bits. [4] Being a half-byte, the nibble was named as a play on words. A person may need several nibbles for one bite from something; similarly, a nybble is a part of a byte. Because four bits allow for sixteen values, a nibble is sometimes known as a hexadecimal digit. [5]

Octal and hexadecimal number display

Octal and hexadecimal encoding are convenient ways to represent binary numbers, as used by computers. Computer engineers often need to write out binary quantities, but in practice writing out a binary number such as 1001001101010001 is tedious and prone to errors. Therefore, binary quantities are written in a base-8, or "octal", or, much more commonly, a base-16, "hexadecimal" (hex), number format. In the decimal system, there are 10 digits, 0 through 9, which combine to form numbers. In an octal system, there are only 8 digits, 0 through 7. That is, the value of an octal "10" is the same as a decimal "8", an octal "20" is a decimal "16", and so on. In a hexadecimal system, there are 16 digits, 0 through 9 followed, by convention, with A through F. That is, a hexadecimal "10" is the same as a decimal "16" and a hexadecimal "20" is the same as a decimal "32". An example and comparison of numbers in different bases is described in the chart below.

When typing numbers, formatting characters are used to describe the number system, for example 000_0000B or 0b000_00000 for binary and 0F8H or 0xf8 for hexadecimal numbers.

Converting between bases

Table 3: Comparison of values in different bases
DecimalBinaryOctalHexadecimal
00000000000
10000010101
20000100202
30000110303
40001000404
50001010505
60001100606
70001110707
80010001008
90010011109
10001010120A
11001011130B
12001100140C
13001101150D
14001110160E
15001111170F

Each of these number systems is a positional system, but while decimal weights are powers of 10, the octal weights are powers of 8 and the hexadecimal weights are powers of 16. To convert from hexadecimal or octal to decimal, for each digit one multiplies the value of the digit by the value of its position and then adds the results. For example:

Representing fractions in binary

Fixed-point numbers

Fixed-point formatting can be useful to represent fractions in binary.

The number of bits needed for the precision and range desired must be chosen to store the fractional and integer parts of a number. For instance, using a 32-bit format, 16 bits may be used for the integer and 16 for the fraction.

The eight's bit is followed by the four's bit, then the two's bit, then the one's bit. The fractional bits continue the pattern set by the integer bits. The next bit is the half's bit, then the quarter's bit, then the ⅛'s bit, and so on. For example:

integer bitsfractional bits
0.500=1/2=00000000 00000000.10000000 00000000
1.250=1+1/4=00000000 00000001.01000000 00000000
7.375=7+3/8=00000000 00000111.01100000 00000000

This form of encoding cannot represent some values in binary. For example, the fraction 1/5, 0.2 in decimal, the closest approximations would be as follows:

13107 / 65536=00000000 00000000.00110011 00110011=0.1999969... in decimal
13108 / 65536=00000000 00000000.00110011 00110100=0.2000122... in decimal

Even if more digits are used, an exact representation is impossible. The number 1/3, written in decimal as 0.333333333..., continues indefinitely. If prematurely terminated, the value would not represent 1/3 precisely.

Floating-point numbers

While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To approximate the greater range and precision of real numbers, we have to abandon signed integers and fixed-point numbers and go to a "floating-point" format.

In the decimal system, we are familiar with floating-point numbers of the form (scientific notation):

1.1030402 × 105 = 1.1030402 × 100000 = 110304.02

or, more compactly:

1.1030402E5

which means "1.1030402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "significand", multiplied by a power of 10 (E5, meaning 105 or 100,000), known as an "exponent". If we have a negative exponent, that means the number is multiplied by a 1 that many places to the right of the decimal point. For example:

2.3434E6 = 2.3434 × 10−6 = 2.3434 × 0.000001 = 0.0000023434

The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller than the range. Similar binary floating-point formats can be defined for computers. There is a number of such schemes, the most popular has been defined by Institute of Electrical and Electronics Engineers (IEEE). The IEEE 754-2008 standard specification defines a 64 bit floating-point format with:

Let's see what this format looks like by showing how such a number would be stored in 8 bytes of memory:

byte 0Sx10x9x8x7x6x5x4
byte 1x3x2x1x0m51m50m49m48
byte 2m47m46m45m44m43m42m41m40
byte 3m39m38m37m36m35m34m33m32
byte 4m31m30m29m28m27m26m25m24
byte 5m23m22m21m20m19m18m17m16
byte 6m15m14m13m12m11m10m9m8
byte 7m7m6m5m4m3m2m1m0

where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation:

<sign>× (1 + <fractional significand>) × 2<exponent> 1023

This scheme provides numbers valid out to about 15 decimal digits, with the following range of numbers:

maximumminimum
positive1.797693134862231E+3084.940656458412465E-324
negative-4.940656458412465E-324-1.797693134862231E+308

The specification also defines several special values that are not defined numbers, and are known as NaNs , for "Not A Number". These are used by programs to designate invalid operations and the like.

Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving seven valid decimal digits.

byte 0Sx7x6x5x4x3x2x1
byte 1x0m22m21m20m19m18m17m16
byte 2m15m14m13m12m11m10m9m8
byte 3m7m6m5m4m3m2m1m0

The bits are converted to a numeric value with the computation:

<sign>× (1 + <fractional significand>) × 2<exponent> 127

leading to the following range of numbers:

maximumminimum
positive3.402823E+382.802597E-45
negative-2.802597E-45-3.402823E+38

Such floating-point numbers are known as "reals" or "floats" in general, but with a number of variations:

A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value".

A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".

The relation between numbers and bit patterns is chosen for convenience in computer manipulation; eight bytes stored in computer memory may represent a 64-bit real, two 32-bit reals, or four signed or unsigned integers, or some other kind of data that fits into eight bytes. The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.

Only a finite range of real numbers can be represented with a given number of bits. Arithmetic operations can overflow or underflow, producing a value too large or too small to be represented.

The representation has a limited precision. For example, only 15 decimal digits can be represented with a 64-bit real. If a very small floating-point number is added to a large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. Analyzing the effect of limited precision is a well-studied problem. Estimates of the magnitude of round-off errors and methods to limit their effect on large calculations are part of any large computation project. The precision limit is different from the range limit, as it affects the significand, not the exponent.

The significand is a binary fraction that doesn't necessarily perfectly match a decimal fraction. In many cases a sum of reciprocal powers of 2 does not match a specific decimal fraction, and the results of computations will be slightly off. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ... [6]

Numbers in programming languages

Programming in assembly language requires the programmer to keep track of the representation of numbers. Where the processor does not support a required mathematical operation, the programmer must work out a suitable algorithm and instruction sequence to carry out the operation; on some microprocessors, even integer multiplication must be done in software.

High-level programming languages such as Ruby and Python offer an abstract number that may be an expanded type such as rational, bignum, or complex. Mathematical operations are carried out by library routines provided by the implementation of the language. A given mathematical symbol in the source code, by operator overloading, will invoke different object code appropriate to the representation of the numerical type; mathematical operations on any number—whether signed, unsigned, rational, floating-point, fixed-point, integral, or complex—are written exactly the same way.

Some languages, such as REXX and Java, provide decimal floating-points operations, which provide rounding errors of a different form.

See also

Notes and references

The initial version of this article was based on a public domain article from Greg Goebel's Vectorsite.

  1. Jon Stokes (2007). Inside the machine: an illustrated introduction to microprocessors and computer architecture. No Starch Press. p. 66. ISBN   978-1-59327-104-6.
  2. "byte definition" . Retrieved 24 April 2012.
  3. "Microprocessor and CPU (Central Processing Unit)". Network Dictionary. Archived from the original on 3 October 2017. Retrieved 1 May 2012.
  4. "nybble definition" . Retrieved 3 May 2012.
  5. "Nybble". TechTerms.com. Retrieved 3 May 2012.
  6. Goebel, Greg. "Computer Numbering Format" . Retrieved 10 September 2012.

Related Research Articles

<span class="mw-page-title-main">Floating-point arithmetic</span> Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents a subset of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in a base-ten representation with five digits of precision:

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9, and "A"–"F" to represent values from ten to fifteen.

In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.

The octal, or oct for short, is the base-8 positional numeral system, and uses the digits 0 to 7. This is to say that 10octal represents eight and 100octal represents sixty-four. However, English, like most languages, uses a base-10 number system, hence a true octal system might use different vocabulary.

Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

The IEEE Standard for Floating-Point Arithmetic is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

The significand is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.

Hexadecimal floating point is a format for encoding floating-point numbers first introduced on the IBM System/360 computers, and supported on subsequent machines based on that architecture, as well as machines which were intended to be application-compatible with System/360.

In computing, signed number representations are required to encode negative numbers in binary number systems.

In computer science, a scale factor is a number used as a multiplier to represent a number on a different scale, functioning similarly to an exponent in mathematics. A scale factor is used when a real-world set of numbers needs to be represented on a different scale in order to fit a specific number format. Although using a scale factor extends the range of representable values, it also decreases the precision, resulting in rounding error for certain calculations.

In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects. Machine learning also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers.

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary (base-2) fractions.

The IEEE 754-2008 standard includes decimal floating-point number formats in which the significand and the exponent can be encoded in two ways, referred to as binary encoding and decimal encoding.

In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.

Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. Like the binary16 format, it is intended for memory saving storage.

In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations.

In computing, octuple precision is a binary floating-point-based computer number format that occupies 32 bytes in computer memory. This 256-bit octuple precision is for applications requiring results in higher than quadruple precision. This format is rarely used and very few environments support it.

The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.