computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...

, quadruple precision (or quad precision) is a binary

floating-point In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a ''significand'' (a Sign (mathematics), signed sequence of a fixed number of digits in some Radix, base) multiplied by an integer power of that ba ...

–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit

double precision Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point arithmetic, floating-point computer number format, number format, usually occupying 64 Bit, bits in computer memory; it represents a wide range of numeri ...

. This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed." In

IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is an American 501(c)(3) public charity professional organization for electrical engineering, electronics engineering, and other related disciplines. The IEEE has a corporate office ...

the 128-bit base-2 format is officially referred to as binary128.

IEEE 754 quadruple-precision binary floating-point format: binary128

The IEEE 754 standard specifies a binary128 as having: * Sign bit: 1 bit * Exponent width: 15 bits * Significand precision: 113 bits (112 explicitly stored) The sign bit determines the sign of the number (including when this number is zero, which is signed). "1" stands for negative. This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number. The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros (used to encode subnormal numbers and zeros). Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: ) for normal values; subnormals have gracefully degrading precision down to 1 bit for the smallest non-zero value. The bits are laid out as: IEEE 754 Quadruple Floating Point Format

IEEE 754 Quadruple Floating Point Format

Exponent encoding

The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard. * E_min = 0001₁₆ − 3FFF₁₆ = −16382 * E_max = 7FFE₁₆ − 3FFF₁₆ = 16383 * Exponent bias = 3FFF₁₆ = 16383 Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent. The stored exponents 0000₁₆ and 7FFF₁₆ are interpreted specially. The minimum strictly positive (subnormal) value is 2⁻¹⁶⁴⁹⁴ ≈ 10⁻⁴⁹⁶⁵ and has a precision of only one bit. The minimum positive normal value is 2⁻¹⁶³⁸² ≈ and has a precision of 113 bits, i.e. ±2⁻¹⁶⁴⁹⁴ as well. The maximum representable value is ≈ .

Quadruple precision examples

These examples are given in bit ''representation'', in

hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...

, of the floating-point value. This includes the sign, (biased) exponent, and significand. 0000 0000 0000 0000 0000 0000 0000 0001₁₆ = 2⁻¹⁶³⁸² × 2⁻¹¹² = 2⁻¹⁶⁴⁹⁴ ≈ 6.4751751194380251109244389582276465525 × 10⁻⁴⁹⁶⁶ (smallest positive subnormal number) 0000 ffff ffff ffff ffff ffff ffff ffff₁₆ = 2⁻¹⁶³⁸² × (1 − 2⁻¹¹²) ≈ 3.3621031431120935062626778173217519551 × 10⁻⁴⁹³² (largest subnormal number) 0001 0000 0000 0000 0000 0000 0000 0000₁₆ = 2⁻¹⁶³⁸² ≈ 3.3621031431120935062626778173217526026 × 10⁻⁴⁹³² (smallest positive normal number) 7ffe ffff ffff ffff ffff ffff ffff ffff₁₆ = 2¹⁶³⁸³ × (2 − 2⁻¹¹²) ≈ 1.1897314953572317650857593266280070162 × 10⁴⁹³² (largest normal number) 3ffe ffff ffff ffff ffff ffff ffff ffff₁₆ = 1 − 2⁻¹¹³ ≈ 0.9999999999999999999999999999999999037 (largest number less than one) 3fff 0000 0000 0000 0000 0000 0000 0000₁₆ = 1 (one) 3fff 0000 0000 0000 0000 0000 0000 0001₁₆ = 1 + 2⁻¹¹² ≈ 1.0000000000000000000000000000000001926 (smallest number larger than one) 4000 0000 0000 0000 0000 0000 0000 0000₁₆ = 2 c000 0000 0000 0000 0000 0000 0000 0000₁₆ = −2 0000 0000 0000 0000 0000 0000 0000 0000₁₆ = 0 8000 0000 0000 0000 0000 0000 0000 0000₁₆ = −0 7fff 0000 0000 0000 0000 0000 0000 0000₁₆ = infinity ffff 0000 0000 0000 0000 0000 0000 0000₁₆ = −infinity 4000 921f b544 42d1 8469 898c c517 01b8₁₆ ≈ 3.1415926535897932384626433832795027975 (closest approximation to π) 3ffd 5555 5555 5555 5555 5555 5555 5555₁₆ ≈ 0.3333333333333333333333333333333333173 (closest approximation to 1/3) By default, 1/3 rounds down like

, because of the odd number of bits in the significand. Thus, the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Double-double arithmetic

A common software technique to implement nearly quadruple precision using ''pairs'' of double-precision values is sometimes called double-double arithmetic.Yozo Hida, X. Li, and D. H. Bailey
Quad-Double Arithmetic: Algorithms, Implementation, and Application
Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al.
Library for double-double and quad-double arithmetic
(2007).J. R. Shewchuk

Discrete & Computational Geometry 18: 305–363, 1997. Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least (actually 107 bits except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits, significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of for double-double versus for binary128). In particular, a double-double/quadruple-precision value ''q'' in the double-double technique is represented implicitly as a sum of two double-precision values ''x'' and ''y'', each of which supplies half of ''q'''s significand. That is, the pair is stored in place of ''q'', and operations on ''q'' values are transformed into equivalent (but more complicated) operations on the ''x'' and ''y'' values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general arbitrary-precision arithmetic techniques. Note that double-double arithmetic has the following special characteristics: * As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is , or . Numbers whose magnitude is smaller than 2⁻¹⁰²¹ will not have additional precision compared with double precision. * The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than a half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the significand of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers. * Because of the reason above, it is possible to represent values like , which is the smallest representable number greater than 1. In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively. A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.

Implementations

Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, , less common (see " Hardware support" below). One can use general arbitrary-precision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.

Computer-language support

A separate question is the extent to which quadruple-precision types are directly incorporated into computer

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s. Quadruple precision is specified in Fortran by the real(real128) (module iso_fortran_env from Fortran 2008 must be used, the constant real128 is equal to 16 on most processors), or as real(selected_real_kind(33, 4931)), or in a non-standard way as REAL*16. (Quadruple-precision REAL*16 is supported by the Intel Fortran Compiler and by the GNU Fortran compiler on x86,

x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit extension of the x86 instruction set architecture, instruction set. It was announced in 1999 and first available in the AMD Opteron family in 2003. It introduces two new ope ...

, and Itanium architectures, for example.) For the C programming language, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies _Float128 as the type implementing the IEEE 754 quadruple-precision format (binary128). Alternatively, in C/ C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language (which only requires long double to be at least as precise as double), nor is it common. On x86 and x86-64, the most common C/C++ compilers implement long double as either 80-bit extended precision (e.g. the GNU C Compiler gcc and the Intel C++ Compiler with a /Qlong‑double switch) or simply as being synonymous with double precision (e.g. Microsoft Visual C++), rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture (AArch64) specifies that long double corresponds to the IEEE 754 quadruple-precision format. On a few other architectures, some C/C++ compilers implement long double as quadruple precision, e.g. gcc on

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

(as double-double) and SPARC, or the Sun Studio compilers on SPARC. Even if long double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called __float128 for x86, x86-64 and Itanium CPUs, and on

as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options; and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called _Quad. Zig provides support for it with its f128 type. Google's work-in-progress language

Carbon Carbon () is a chemical element; it has chemical symbol, symbol C and atomic number 6. It is nonmetallic and tetravalence, tetravalent—meaning that its atoms are able to form up to four covalent bonds due to its valence shell exhibiting 4 ...

provides support for it with the type called f128. As of 2024, Rust is currently working on adding a new f128 type for IEEE quadruple-precision 128-bit floats.

Libraries and toolboxes

* The GCC quad-precision math library
libquadmath
provides __float128 and __complex128 operations. * The Boost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for __float128 and _Quad types, and includes a custom implementation of the standard math library. * The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations in MATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra. * The DoubleFloats package provides support for double-double computations for the Julia programming language. * The doubledouble.py library enables double-double computations in Python. * Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values (Real128), and 256-bit complex values (Complex256).

Hardware support

IEEE quadruple precision was added to the IBM System/390 G5 in 1998, and is supported in hardware in subsequent z/Architecture processors. The IBM POWER9 CPU ( Power ISA 3.0) has native 128-bit hardware support. Native support of IEEE 128-bit floats is defined in

PA-RISC Precision Architecture reduced instruction set computer, RISC (PA-RISC) or Hewlett Packard Precision Architecture (HP/PA or simply HPPA), is a computer, general purpose computer instruction set architecture (ISA) developed by Hewlett-Packard f ...

1.0, and in SPARC V8 and V9 architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware . Non-IEEE extended-precision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to the

IBM System/370 The IBM System/370 (S/370) is a range of IBM mainframe computers announced as the successors to the IBM System/360, System/360 family on June 30, 1970. The series mostly maintains backward compatibility with the S/360, allowing an easy migrati ...

series (1970s–1980s) and was available on some

System/360 The IBM System/360 (S/360) is a family of mainframe computer systems announced by IBM on April 7, 1964, and delivered between 1965 and 1978. System/360 was the first family of computers designed to cover both commercial and scientific applicati ...

models in the 1960s (System/360-85, -195, and others by special request or simulated by OS software). The Siemens 7.700 and 7.500 series mainframes and their successors support the same floating-point formats and instructions as the IBM System/360 and System/370. The VAX processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software. The NEC Vector Engine architecture supports adding, subtracting, multiplying and comparing 128-bit binary IEEE 754 quadruple-precision numbers. Two neighboring 64-bit registers are used. Quadruple-precision arithmetic is not supported in the vector register. The

RISC-V RISC-V (pronounced "risk-five") is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. The project commenced in 2010 at the University of California, Berkeley. It transfer ...

architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating-point arithmetic. The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.
Chapter 15, p. 95. Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement Single instruction, multiple data, SIMD instructions, such as

Streaming SIMD Extensions In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data ( SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in its Pentium III series of central processing units (CPU ...

or AltiVec, which refers to 128-bit vectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.

References

External links

High-Precision Software Directory

QPFloat
a

free software Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...

( GPL) software library for quadruple-precision arithmetic
HPAlib
a free software (

LGPL The GNU Lesser General Public License (LGPL) is a free-software license published by the Free Software Foundation (FSF). The license allows developers and companies to use and integrate a software component released under the LGPL into their own ...

) software library for quad-precision arithmetic
libquadmath
the GCC quad-precision math library
IEEE-754 Analysis
interactive web page for examining binary32, binary64, and binary128 floating-point values {{data types Binary arithmetic Floating point types