Floating-point formats |
---|
IEEE 754 |
|
Other |
In C and related programming languages, long double
refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double
. As with C's other floating-point types, it may not necessarily map to an IEEE format.
long double
in CThe long double
type was present in the original 1989 C standard, [1] but support was improved by the 1999 revision of the C standard, or C99, which extended the standard library to include functions operating on long double
such as sinl()
and strtold()
.
Long double constants are floating-point constants suffixed with "L" or "l" (lower-case L), e.g., 0.3333333333333333333333333333333333L or 3.1415926535897932384626433832795029L for quadruple precision. Without a suffix, the evaluation depends on FLT_EVAL_METHOD.
On the x86 architecture, most C compilers implement long double
as the 80-bit extended precision type supported by x86 hardware (generally stored as 12 or 16 bytes to maintain data structure alignment), as specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). An exception is Microsoft Visual C++ for x86, which makes long double
a synonym for double
. [2] The Intel C++ compiler on Microsoft Windows supports extended precision, but requires the /Qlong‑double
switch for long double
to correspond to the hardware's extended precision format. [3]
Compilers may also use long double
for the IEEE 754 quadruple-precision binary floating-point format (binary128). This is the case on HP-UX, [4] Solaris/SPARC, [5] MIPS with the 64-bit or n32 ABI, [6] 64-bit ARM (AArch64) [7] (on operating systems using the standard AAPCS calling conventions, such as Linux), and z/OS with FLOAT(IEEE) [8] [9] [10] . Most implementations are in software, but some processors have hardware support.
On some PowerPC systems, [11] long double
is implemented as a double-double arithmetic, where a long double
value is regarded as the exact sum of two double-precision values, giving at least a 106-bit precision; with such a format, the long double
type does not conform to the IEEE floating-point standard. Otherwise, long double
is simply a synonym for double
(double precision), e.g. on 32-bit ARM, [12] 64-bit ARM (AArch64) (on Windows [13] and macOS [14] ) and on 32-bit MIPS [15] (old ABI, a.k.a. o32).
With the GNU C Compiler, long double
is 80-bit extended precision on x86 processors regardless of the physical storage used for the type (which can be either 96 or 128 bits), [16] On some other architectures, long double
can be double-double (e.g. on PowerPC [17] [18] [19] ) or 128-bit quadruple precision (e.g. on SPARC [20] ). As of gcc 4.3, a quadruple precision is also supported on x86, but as the nonstandard type __float128
rather than long double
. [21]
Although the x86 architecture, and specifically the x87 floating-point instructions on x86, supports 80-bit extended-precision operations, it is possible to configure the processor to automatically round operations to double (or even single) precision. Conversely, in extended-precision mode, extended precision may be used for intermediate compiler-generated calculations even when the final results are stored at a lower precision (i.e. FLT_EVAL_METHOD == 2). With gcc on Linux, 80-bit extended precision is the default; on several BSD operating systems (FreeBSD and OpenBSD), double-precision mode is the default, and long double
operations are effectively reduced to double precision. [22] (NetBSD 7.0 and later, however, defaults to 80-bit extended precision [23] ). However, it is possible to override this within an individual program via the FLDCW "floating-point load control-word" instruction. [22] On x86_64, the BSDs default to 80-bit extended precision. Microsoft Windows with Visual C++ also sets the processor in double-precision mode by default, but this can again be overridden within an individual program (e.g. by the _controlfp_s
function in Visual C++ [24] ). The Intel C++ Compiler for x86, on the other hand, enables extended-precision mode by default. [25] On IA-32 OS X, long double
is 80-bit extended precision. [26]
In CORBA (from specification of 3.0, which uses "ANSI/IEEE Standard 754-1985" as its reference), "the long double data type represents an IEEE double-extended floating-point number, which has an exponent of at least 15 bits in length and a signed fraction of at least 64 bits", with GIOP/IIOP CDR, whose floating-point types "exactly follow the IEEE standard formats for floating point numbers", marshalling this as what seems to be IEEE 754-2008 binary128 a.k.a. quadruple precision without using that name.
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be represented as a base-ten floating-point number:
Double-precision floating-point format is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
x86-64 is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.
SSE2 is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.
In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator ; the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:
C99 is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version (C90) with new features for the language and the standard library, and helps implementations make better use of available computer hardware, such as IEEE 754-1985 floating-point arithmetic, and compiler technology. The C11 version of the C programming language standard, published in 2011, replaces C99.
In computer software, in compiler theory, an intrinsic function is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may substitute a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function, the compiler has an intimate knowledge of an intrinsic function and can thus better integrate and optimize it for a given situation.
In computer architecture, 128-bit integers, memory addresses, or other data units are those that are 128 bits wide. Also, 128-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers, address buses, or data buses of that size.
Mathomatic is a free, portable, general-purpose computer algebra system (CAS) that can symbolically solve, simplify, combine, and compare algebraic equations, and can perform complex number, modular, and polynomial arithmetic, along with standard arithmetic. It does some symbolic calculus (derivative, extrema, Taylor series, and polynomial integration and Laplace transforms), numerical integration, and handles all elementary algebra except logarithms. Trigonometric functions can be entered and manipulated using complex exponentials, with the GNU m4 preprocessor. Not currently implemented are general functions like f(x), arbitrary-precision and interval arithmetic, and matrices.
This article describes the calling conventions used when programming x86 architecture microprocessors.
Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types using special software.
Clang (/klæŋ/) is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC), supporting most of its compilation flags and unofficial language extensions. It includes a static analyzer, and several code analysis tools.
Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.
In computing, quadruple precision is a binary floating point–based computer number format that occupies 16 bytes with precision at least twice the 53-bit double precision.
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC languages, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
Some programming languages provide a built-in (primitive) or library decimal data type to represent non-repeating decimal fractions like 0.3 and -1.17 without rounding, and to do arithmetic on them. Examples are the decimal.Decimal
type of Python, and analogous types provided by other languages.
AArch64 or ARM64 is the 64-bit extension of the ARM architecture family.
Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions, are extensions to the x86 instruction set architecture (ISA) for microprocessors from Intel and Advanced Micro Devices (AMD) designed to work on matrices to accelerate artificial intelligence (AI) / machine learning (ML) -related workloads.