In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, quadruple precision (or quad precision) is a binary
floating point
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...
–based
computer number format
A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The e ...
that occupies 16 bytes (128 bits) with precision at least twice the 53-bit
double precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
.
This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and
round-off error
A roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are d ...
s in intermediate calculations and scratch variables.
William Kahan
William "Velvel" Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist, who received the Turing Award in 1989 for "''his fundamental contributions to numerical analysis''",
was named an ACM Fellow in 1994, and inducte ...
, primary architect of the original IEEE-754 floating point standard noted, "For now the
10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when
IEEE Standard 754 for Floating-Point Arithmetic was framed."
In
IEEE 754-2008
The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
the 128-bit base-2 format is officially referred to as binary128.
IEEE 754 quadruple-precision binary floating-point format: binary128
The IEEE 754 standard specifies a binary128 as having:
*
Sign bit
In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...
: 1 bit
*
Exponent
Exponentiation is a mathematical operation, written as , involving two numbers, the '' base'' and the ''exponent'' or ''power'' , and pronounced as " (raised) to the (power of) ". When is a positive integer, exponentiation corresponds to re ...
width: 15 bits
*
Significand
The significand (also mantissa or coefficient, sometimes also argument, or ambiguously fraction or characteristic) is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on ...
precision
Precision, precise or precisely may refer to:
Science, and technology, and mathematics Mathematics and computing (general)
* Accuracy and precision, measurement deviation from true value and its scatter
* Significant figures, the number of digit ...
: 113 bits (112 explicitly stored)
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the
significand
The significand (also mantissa or coefficient, sometimes also argument, or ambiguously fraction or characteristic) is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on ...
appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: ). The bits are laid out as:
Exponent encoding
The quadruple-precision binary floating-point exponent is encoded using an
offset binary
Offset binary, also referred to as excess-K, excess-''N'', excess-e, excess code or biased representation, is a method for signed number representation where a signed number n is represented by the bit pattern corresponding to the unsigned numbe ...
representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.
* E
min = 0001
16 − 3FFF
16 = −16382
* E
max = 7FFE
16 − 3FFF
16 = 16383
*
Exponent bias
In IEEE 754 Floating-point arithmetic, floating-point numbers, the exponent is biased in the biasing, engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent.
Biasing is ...
= 3FFF
16 = 16383
Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 0000
16 and 7FFF
16 are interpreted specially.
The minimum strictly positive (subnormal) value is 2
−16494 ≈ 10
−4965 and has a precision of only one bit.
The minimum positive normal value is 2
−16382 ≈ and has a precision of 113 bits, i.e. ±2
−16494 as well. The maximum representable value is ≈ .
Quadruple precision examples
These examples are given in bit ''representation'', in
hexadecimal
In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
,
of the floating-point value. This includes the sign, (biased) exponent, and significand.
0000 0000 0000 0000 0000 0000 0000 0001
16 = 2
−16382 × 2
−112 = 2
−16494
≈ 6.4751751194380251109244389582276465525 × 10
−4966
(smallest positive subnormal number)
0000 ffff ffff ffff ffff ffff ffff ffff
16 = 2
−16382 × (1 − 2
−112)
≈ 3.3621031431120935062626778173217519551 × 10
−4932
(largest subnormal number)
0001 0000 0000 0000 0000 0000 0000 0000
16 = 2
−16382
≈ 3.3621031431120935062626778173217526026 × 10
−4932
(smallest positive normal number)
7ffe ffff ffff ffff ffff ffff ffff ffff
16 = 2
16383 × (2 − 2
−112)
≈ 1.1897314953572317650857593266280070162 × 10
4932
(largest normal number)
3ffe ffff ffff ffff ffff ffff ffff ffff
16 = 1 − 2
−113
≈ 0.9999999999999999999999999999999999037
(largest number less than one)
3fff 0000 0000 0000 0000 0000 0000 0000
16 = 1 (one)
3fff 0000 0000 0000 0000 0000 0000 0001
16 = 1 + 2
−112
≈ 1.0000000000000000000000000000000001926
(smallest number larger than one)
c000 0000 0000 0000 0000 0000 0000 0000
16 = −2
0000 0000 0000 0000 0000 0000 0000 0000
16 = 0
8000 0000 0000 0000 0000 0000 0000 0000
16 = −0
7fff 0000 0000 0000 0000 0000 0000 0000
16 = infinity
ffff 0000 0000 0000 0000 0000 0000 0000
16 = −infinity
4000 921f b544 42d1 8469 898c c517 01b8
16 ≈ π
3ffd 5555 5555 5555 5555 5555 5555 5555
16 ≈ 1/3
By default, 1/3 rounds down like
double precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
, because of the odd number of bits in the significand.
So the bits beyond the rounding point are
0101...
which is less than 1/2 of a
unit in the last place
In computer science and numerical analysis, unit in the last place or unit of least precision (ulp) is the spacing between two consecutive floating-point numbers, i.e., the value the least significant digit (rightmost digit) represents if it is 1 ...
.
Double-double arithmetic
A common software technique to implement nearly quadruple precision using ''pairs'' of
double-precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
values is sometimes called double-double arithmetic.
[Yozo Hida, X. Li, and D. H. Bailey]
Quad-Double Arithmetic: Algorithms, Implementation, and Application
Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al.
Library for double-double and quad-double arithmetic
(2007).[J. R. Shewchuk]
Discrete & Computational Geometry
'' Discrete & Computational Geometry'' is a peer-reviewed mathematics journal published quarterly by Springer. Founded in 1986 by Jacob E. Goodman and Richard M. Pollack, the journal publishes articles on discrete geometry and computational geom ...
18:305–363, 1997. Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least
[ (actually 107 bits except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,] significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of for double-double versus for binary128).
In particular, a double-double/quadruple-precision value ''q'' in the double-double technique is represented implicitly as a sum of two double-precision values ''x'' and ''y'', each of which supplies half of ''q'''s significand.[ That is, the pair is stored in place of ''q'', and operations on ''q'' values are transformed into equivalent (but more complicated) operations on the ''x'' and ''y'' values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general ]arbitrary-precision arithmetic
In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are li ...
techniques.[
Note that double-double arithmetic has the following special characteristics:
* As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is , or . Numbers whose magnitude is smaller than 2−1021 will not have additional precision compared with double precision.
* The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the significant of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
* Because of the reason above, it is possible to represent values like , which is the smallest representable number greater than 1.
In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.
A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.
]
Implementations
Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016, less common (see "Hardware support
Hardware may refer to:
Technology Computing and electronics
* Electronic hardware, interconnected electronic components which perform analog or logic operations
** Digital electronics, electronics that operate on digital signals
*** Computer hardw ...
" below). One can use general arbitrary-precision arithmetic
In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are li ...
libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.
Computer-language support
A separate question is the extent to which quadruple-precision types are directly incorporated into computer programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language.
The description of a programming ...
s.
Quadruple precision is specified in Fortran by the real(real128)
(module iso_fortran_env
from Fortran 2008 must be used, the constant real128
is equal to 16 on most processors), or as real(selected_real_kind(33, 4931))
, or in a non-standard way as REAL*16
. (Quadruple-precision REAL*16
is supported by the Intel Fortran Compiler
Intel Fortran Compiler, is a group of Fortran compilers from Intel for Windows, macOS, and Linux.
Overview
The compilers generate code for IA-32 and Intel 64 processors and certain non-Intel but compatible processors, such as certain AMD process ...
and by the GNU Fortran
GNU Fortran or GFortran is the GNU Fortran compiler, which is part of the GNU Compiler Collection (GCC).
It includes full support for the Fortran#Fortran 95, Fortran 95 language, and supports large parts of the Fortran#Fortran 2003, Fortran 2003 an ...
compiler on x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
, x86-64
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...
, and Itanium
Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance computin ...
architectures, for example.)
For the C programming language
''The C Programming Language'' (sometimes termed ''K&R'', after its authors' initials) is a computer programming book written by Brian Kernighan and Dennis Ritchie, the latter of whom originally designed and implemented the language, as well as ...
, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies _Float128
as the type implementing the IEEE 754 quadruple-precision format (binary128). Alternatively, in C/C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
with a few systems and compilers, quadruple precision may be specified by the long double
In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double. As with C's other flo ...
type, but this is not required by the language (which only requires long double
to be at least as precise as double
), nor is it common.
On x86 and x86-64, the most common C/C++ compilers implement long double
as either 80-bit extended precision
Extended precision refers to floating-point arithmetic, floating-point number formats that provide greater precision (computer science), precision than the basic floating-point formats. Extended precision formats support a basic format by floati ...
(e.g. the GNU C Compiler
The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software ...
gcc and the Intel C++ Compiler
Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems.
Overview
Intel o ...
with a /Qlong‑double
switch) or simply as being synonymous with double precision (e.g. Microsoft Visual C++
Microsoft Visual C++ (MSVC) is a compiler for the C, C++ and C++/CX programming languages by Microsoft. MSVC is proprietary software; it was originally a standalone product but later became a part of Visual Studio and made available in both tria ...
), rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture (AArch64) specifies that long double
corresponds to the IEEE 754 quadruple-precision format. On a few other architectures, some C/C++ compilers implement long double
as quadruple precision, e.g. gcc on PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
(as double-double) and SPARC
SPARC (Scalable Processor Architecture) is a reduced instruction set computer (RISC) instruction set architecture originally developed by Sun Microsystems. Its design was strongly influenced by the experimental Berkeley RISC system developed ...
, or the Sun Studio compilers on SPARC. Even if long double
is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called __float128
for x86, x86-64 and Itanium
Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance computin ...
CPUs, and on PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options; and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called _Quad
.
Google's work-in-progress language Carbon
Carbon () is a chemical element with the symbol C and atomic number 6. It is nonmetallic and tetravalent
In chemistry, the valence (US spelling) or valency (British spelling) of an element is the measure of its combining capacity with o ...
provides support for it with the type called 'f128'.
Libraries and toolboxes
* The GCC quad-precision math library
libquadmath
provides __float128
and __complex128
operations.
* The Boost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for __float128
and _Quad
types, and includes a custom implementation of the standard math library.
* The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations in MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.
* The DoubleFloats package provides support for double-double computations for the Julia programming language.
* The doubledouble.py library enables double-double computations in Python.
* Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values (Real128), and 256-bit complex values (Complex256).
Hardware support
IEEE quadruple precision was added to the IBM System/390
The IBM System/390 is a discontinued mainframe product family implementing the ESA/390, the fifth generation of the System/360 instruction set architecture. The first computers to use the ESA/390 were the Enterprise System/9000 (ES/9000) ...
G5 in 1998, and is supported in hardware in subsequent z/Architecture
z/Architecture, initially and briefly called ESA Modal Extensions (ESAME), is IBM's 64-bit complex instruction set computer (CISC) instruction set architecture, implemented by its mainframe computers. IBM introduced its first z/Architecture-b ...
processors. The IBM POWER9
POWER9 is a family of superscalar, multithreading, multi-core microprocessors produced by IBM, based on the Power ISA. It was announced in August 2016. The POWER9-based processors are being manufactured using a 14 nm FinFET process, in ...
CPU ( Power ISA 3.0) has native 128-bit hardware support.[
Native support of IEEE 128-bit floats is defined in ]PA-RISC
PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as ...
1.0, and in SPARC
SPARC (Scalable Processor Architecture) is a reduced instruction set computer (RISC) instruction set architecture originally developed by Sun Microsystems. Its design was strongly influenced by the experimental Berkeley RISC system developed ...
V8 and V9 architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware .
Non-IEEE extended-precision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to the IBM System/370
The IBM System/370 (S/370) is a model range of IBM mainframe computers announced on June 30, 1970, as the successors to the System/360 family. The series mostly maintains backward compatibility with the S/360, allowing an easy migration path f ...
series (1970s–1980s) and was available on some System/360
The IBM System/360 (S/360) is a family of mainframe computer systems that was announced by IBM on April 7, 1964, and delivered between 1965 and 1978. It was the first family of computers designed to cover both commercial and scientific applica ...
models in the 1960s (System/360-85, -195, and others by special request or simulated by OS software).
The Siemens
Siemens AG ( ) is a German multinational conglomerate corporation and the largest industrial manufacturing company in Europe headquartered in Munich with branch offices abroad.
The principal divisions of the corporation are ''Industry'', '' ...
7.700 and 7.500 series mainframes and their successors support the same floating-point formats and instructions as the IBM System/360 and System/370.
The VAX
VAX (an acronym for Virtual Address eXtension) is a series of computers featuring a 32-bit instruction set architecture (ISA) and virtual memory that was developed and sold by Digital Equipment Corporation (DEC) in the late 20th century. The V ...
processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software.
The RISC-V
RISC-V (pronounced "risk-five" where five refers to the number of generations of RISC architecture that were developed at the University of California, Berkeley since 1981) is an open standard instruction set architecture (ISA) based on estab ...
architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating point arithmetic. The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.
Chapter 15 (p. 95).
Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement Single instruction, multiple data, SIMD instructions, such as Streaming SIMD Extensions
In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) ...
or AltiVec, which refers to 128-bit vectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.
See also
* IEEE 754
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found i ...
, IEEE standard for floating-point arithmetic
* ISO/IEC 10967
ISO/IEC 10967, Language independent arithmetic (LIA), is a series of
standards on computer arithmetic. It is compatible with ISO/IEC/IEEE 60559:2011,
more known as IEEE 754-2008, and much of the
specifications are for IEEE 754 special values
(tho ...
, Language independent arithmetic
* Primitive data type
In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled pr ...
* Q notation (scientific notation)
Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, or ...
References
External links
High-Precision Software Directory
QPFloat
a free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
(GPL
The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general u ...
) software library for quadruple-precision arithmetic
HPAlib
a free software (LGPL
The GNU Lesser General Public License (LGPL) is a free-software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate a software component released under the LGPL into their own ...
) software library for quad-precision arithmetic
libquadmath
the GCC quad-precision math library
IEEE-754 Analysis
Interactive web page for examining Binary32, Binary64, and Binary128 floating-point values
{{data types
Binary arithmetic
Floating point types