Extended precision refers to

floating-point In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a ''significand'' (a Sign (mathematics), signed sequence of a fixed number of digits in some Radix, base) multiplied by an integer power of that ba ...

number formats that provide greater precision than the basic floating-point formats. Extended-precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to ''extended precision'',

arbitrary-precision arithmetic In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are po ...

refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).

Extended-precision implementations

There is a long history of extended floating-point formats reaching back nearly to the middle of the last century.. Various manufacturers have used different formats for extended precision for different machines. In many cases the format of the extended precision is not quite the same as a scale-up of the ordinary single- and double-precision formats it is meant to extend. In a few cases the implementation was merely a software-based change in the floating-point data format, but in most cases extended precision was implemented in hardware, either built into the central processor itself, or more often, built into the hardware of an optional, attached processor called a "

floating-point unit A floating-point unit (FPU), numeric processing unit (NPU), colloquially math coprocessor, is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition, subtraction, multip ...

" (FPU) or "floating-point processor" ( FPP), accessible to the CPU as a fast input / output device.

IBM extended-precision formats

The IBM 1130, sold in 1965, offered two floating-point formats: A 32-bit "standard precision" format and a 40-bit "extended precision" format. Standard-precision format contains a 24-bit

two's complement Two's complement is the most common method of representing signed (positive, negative, and zero) integers on computers, and more generally, fixed point binary values. Two's complement uses the binary digit with the ''greatest'' value as the ''s ...

significand The significand (also coefficient, sometimes argument, or more ambiguously mantissa, fraction, or characteristic) is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its s ...

while extended-precision utilizes a 32-bit two's complement significand. The latter format makes full use of the CPU's 32-bit integer operations. The characteristic in both formats is an 8-bit field containing the power of two biased by 128. Floating-point arithmetic operations are performed by software, and

double precision Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point arithmetic, floating-point computer number format, number format, usually occupying 64 Bit, bits in computer memory; it represents a wide range of numeri ...

is not supported at all. The extended format occupies three 16-bit words, with the extra space simply ignored. The

IBM System/360 The IBM System/360 (S/360) is a family of mainframe computer systems announced by IBM on April 7, 1964, and delivered between 1965 and 1978. System/360 was the first family of computers designed to cover both commercial and scientific applicati ...

supports a 32-bit "short" floating-point format and a 64-bit "long" floating-point format. The 360/85 and follow-on System/370 add support for a 128-bit "extended" format. These formats are still supported in the current

design A design is the concept or proposal for an object, process, or system. The word ''design'' refers to something that is or has been intentionally created by a thinking agent, and is sometimes used to refer to the inherent nature of something ...

, where they are now called the " hexadecimal floating-point" (HFP) formats.

Microsoft MBF extended-precision format

The Microsoft BASIC port for the 6502 CPU, such as in adaptations like

Commodore BASIC Commodore BASIC, also known as PET BASIC or CBM-BASIC, is the Dialect (computing), dialect of the BASIC programming language used in Commodore International's 8-bit home computer line, stretching from the Commodore PET, PET (1977) to the Commodore ...

, AppleSoft BASIC, KIM-1 BASIC or MicroTAN BASIC, supports an extended 40-bit variant of the floating-point format '' Microsoft Binary Format'' (MBF) since 1977.

IEEE 754 extended-precision formats

The

IEEE 754 The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard #Design rationale, add ...

floating-point standard recommends that implementations provide extended-precision formats. The standard specifies the minimum requirements for an extended format but does not specify an encoding. The encoding is the implementor's choice. The IA32,

x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit extension of the x86 instruction set architecture, instruction set. It was announced in 1999 and first available in the AMD Opteron family in 2003. It introduces two new ope ...

, and

Itanium Itanium (; ) is a discontinued family of 64-bit computing, 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly dev ...

processors support what is by far the most influential format on this standard, the Intel 80-bit (64-bit significand) "double extended" format, described in the next section. The Motorola 6888x math coprocessors and the Motorola 68040 and 68060 processors also support a 64-bit significand extended-precision format (similar to the Intel format, although padded to a 96-bit format with 16 unused bits inserted between the exponent and significand fields, and values with exponent zero and bit 63 one are normalized values). The follow-on Coldfire processors do not support this 96-bit extended-precision format. The FPA10 math coprocessor for early ARM processors also supports a 64-bit significand extended-precision format (similar to the Intel format although padded to a 96-bit format with 16 zero bits inserted between the sign and the exponent fields), but without correct rounding. The x87 and Motorola 68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format, as does the IEEE 754

128-bit General home computing and gaming utility emerged at 8-bit word sizes, as 28=256 Word (computer architecture), words, a natural unit of data, became possible. Early 8-bit CPUs (such as the Zilog Z80 and MOS Technology 6502, used in the 1977 Co ...

binary format.

x86 extended-precision format

The x86 extended-precision format is an 80-bit format first implemented in the

Intel 8087 The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. The purpose of the chip was to speed up floating-point arithmetic operations, such as addition, subtraction, multiplication, div ...

math coprocessor and is supported by all processors that are based on the x86 design that incorporate a

(FPU). The Intel 8087 was the first

x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...

device which supported floating-point arithmetic in hardware. It was designed to support a 32-bit "single precision" format and a 64-bit "double-precision" format for encoding and interchanging floating-point numbers. The extended format was designed not to store data at higher precision, but rather to allow for the computation of temporary double results more reliably and accurately by minimising overflow and roundoff-errors in intermediate calculations. All the floating-point registers in the 8087 hold this format, and it automatically converts numbers to this format when loading registers from

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...

and also converts results back to the more conventional formats when storing the registers back into memory. To enable intermediate subexpression results to be saved in extended precision scratch variables and continued across programming language statements, and otherwise interrupted calculations to resume where they were interrupted, it provides instructions which transfer values between these internal registers and memory without performing any conversion, which therefore enables access to the extended format for calculations – also reviving the issue of the accuracy of functions of such numbers, but at a higher precision. The

s (FPU) on all subsequent

processors have supported this format. As a result, software can be developed which takes advantage of the higher precision provided by this format. William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An extended format as wide as we dared (80 bits) was included to serve the same support role as the 13 decimal internal format serves in Hewlett-Packard's 10 decimal calculators." Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087, and that the x87 extended precision was designed to be extensible to higher precision in future processors: : "For now the 10 byte extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16 byte format. ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed." This 80-bit format uses one bit for the sign of the significand, 15 bits for the exponent field (i.e. the same range as the 128-bit quadruple precision IEEE 754 format) and 64 bits for the significand. The exponent field is biased by 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actual An exponent field value of 32767 (all fifteen bits 1) is reserved so as to enable the representation of special states such as

infinity Infinity is something which is boundless, endless, or larger than any natural number. It is denoted by \infty, called the infinity symbol. From the time of the Ancient Greek mathematics, ancient Greeks, the Infinity (philosophy), philosophic ...

and Not a Number. If the exponent field is zero, the value is a subnormal number and the exponent of 2 is −16382. In the following table, "" is the value of the sign bit (0 means positive, 1 means negative), "" is the value of the exponent field interpreted as a positive integer, and "" is the significand interpreted as a positive binary number, where the binary point is located between bits 63 and 62. The "" field is the combination of the integer and fraction parts in the above diagram. : In contrast to the single- and double-precision formats, this format does not utilize an implicit / hidden bit. Rather, bit 63 contains the integer part of the significand and bits 62–0 hold the fractional part. Bit 63 will be 1 on all normalized numbers. There were several advantages to this design when the

8087 The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. The purpose of the chip was to speed up floating-point arithmetic operations, such as addition, subtraction, multiplication, di ...

was being developed: * Calculations can be completed a little faster if all bits of the significand are present in the register. * A 64-bit significand provides sufficient precision to avoid loss of precision when the results are converted back to double-precision format in the vast number of cases. * This format provides a mechanism for indicating precision loss due to underflow which can be carried through further operations. For example, the calculation generates the intermediate result which is a subnormal and also involves precision loss. The product of all of the terms is which can be represented as a normalized number. The 80287 could complete this calculation and indicate the loss of precision by returning an "subnormal" result (exponent not 0, bit 63 = 0). Processors since the 80387 no longer generate unnormals and do not support unnormal inputs to operations. They will generate a subnormal if an underflow occurs but will generate a normalized result if subsequent operations on the subnormal can be normalized.

Examples

These examples are given in bit ''representation'', in

hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...

, of the floating-point value. This includes the sign, (biased) exponent, and significand. 0000 0000 0000 0000 0001₁₆ = 2⁻¹⁶³⁸² × 2⁻⁶³ = 2⁻¹⁶⁴⁴⁵ ≈ 3.64519953188247460252841 × 10⁻⁴⁹⁵¹ (smallest positive subnormal number) 0000 7fff ffff ffff ffff₁₆ = 2⁻¹⁶³⁸² × (1 − 2⁻⁶³) ≈ 3.36210314311209350589816 × 10⁻⁴⁹³² (largest subnormal number) 0001 8000 0000 0000 0000₁₆ = 2⁻¹⁶³⁸² ≈ 3.36210314311209350626268 × 10⁻⁴⁹³² (smallest positive normal number) 7ffe ffff ffff ffff ffff₁₆ = 2¹⁶³⁸⁴ × (1 − 2⁻⁶⁴) ≈ 1.18973149535723176502126 × 10⁴⁹³² (largest normal number) 3ffe ffff ffff ffff ffff₁₆ = 1 − 2⁻⁶⁴ ≈ 0.99999999999999999994579 (largest number less than one) 3fff 8000 0000 0000 0000₁₆ = 1 (one) 3fff 8000 0000 0000 0001₁₆ = 1 + 2⁻⁶³ ≈ 1.00000000000000000010842 (smallest number larger than one) 4000 8000 0000 0000 0000₁₆ = 2 c000 8000 0000 0000 0000₁₆ = −2 0000 0000 0000 0000 0000₁₆ = 0 8000 0000 0000 0000 0000₁₆ = −0 3ffd aaaa aaaa aaaa aaab₁₆ ≈ 0.33333333333333333334237 (closest approximation to 1/3) 4000 c90f daa2 2168 c235₁₆ ≈ 3.14159265358979323851281 (closest approximation to π)

Introduction to use

The 80-bit floating-point format was widely available by 1984, after the development of C, Fortran and similar computer languages, which initially offered only the common 32- and 64-bit floating-point sizes. On the x86 design most C compilers now support 80-bit extended precision via the

long double In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double. As with C's other float ...

type, and this was specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). Compilers on x86 for other languages often support extended precision as well, sometimes via nonstandard extensions: For example,

Turbo Pascal Turbo Pascal is a software development system that includes a compiler and an integrated development environment (IDE) for the programming language Pascal (programming language), Pascal running on the operating systems CP/M, CP/M-86, and MS-DOS. ...

offers an type, and several Fortran compilers have a type (analogous to and ). Such compilers also typically include extended-precision mathematical

subroutine In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit of software logic that has a well-defined interface and behavior and can be invoked multiple times. Callable units provide a ...

s, such as

square root In mathematics, a square root of a number is a number such that y^2 = x; in other words, a number whose ''square'' (the result of multiplying the number by itself, or y \cdot y) is . For example, 4 and −4 are square roots of 16 because 4 ...

and

trigonometric function In mathematics, the trigonometric functions (also called circular functions, angle functions or goniometric functions) are real functions which relate an angle of a right-angled triangle to ratios of two side lengths. They are widely used in all ...

s, in their standard

libraries A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...

Working range

The 80-bit floating-point format has a range (including subnormals) from approximately to Although this format is usually described as giving approximately eighteen significant digits of precision (the floor of the minimum guaranteed precision). The use of decimal when talking about binary is unfortunate because most decimal fractions are recurring sequences in binary just as is in decimal. Thus, a value such as 10.15, is represented in binary as equivalent to 10.1499996185 etc. in decimal for but 10.15000000000000035527 etc. in : inter-conversion will involve approximation, except for those few decimal fractions that represent an exact binary value, such as 0.625 . For , the decimal string is 10.1499999999999999996530553 etc. The last 9 digit is the eighteenth fractional digit and thus the twentieth significant digit of the string. Bounds on conversion between decimal and binary for the 80-bit format can be given as follows: If a decimal string with at most 18 significant digits is correctly rounded to an 80-bit IEEE 754 binary floating-point value (as on input) then converted back to the same number of significant decimal digits (as for output), then the final string will exactly match the original; while, conversely, if an 80-bit IEEE 754 binary floating-point value is correctly converted and (nearest) rounded to a decimal string with at least 21 significant decimal digits then converted back to binary format it will exactly match the original. These approximations are particularly troublesome when specifying the best value for constants in formulae to high precision, as might be calculated via

Need for the 80-bit format

A notable example of the need for a minimum of 64 bits of precision in the significand of the extended-precision format is the need to avoid precision loss when performing exponentiation on double-precision values. The

floating-point units do not provide an instruction that directly performs

exponentiation In mathematics, exponentiation, denoted , is an operation (mathematics), operation involving two numbers: the ''base'', , and the ''exponent'' or ''power'', . When is a positive integer, exponentiation corresponds to repeated multiplication ...

: Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation: :

x^y = 2^

In order to avoid precision loss, the intermediate results "" and "" must be computed with much higher precision, because effectively both the exponent and the significand fields of must fit into the significand field of the intermediate result. Subsequently, the significand field of the intermediate result is split between the exponent and significand fields of the final result when is calculated. The following discussion describes this requirement in more detail. With a little unpacking, an

double-precision value can be represented as: :

2^\ \cdot\ M\

where is the sign of the exponent (either 0 or 1), is the unbiased exponent, which is an integer that ranges from 0 to 1023, and is the significand which is a 53-bit value that falls in the range Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussion does not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations where is less than 1, the value is actually a de-normal and therefore may have already suffered precision loss. This situation is beyond the scope of this article). Taking the log of this representation of a double-precision number and simplifying results in the following: :

\log_2(2^\ \cdot\,M) = (-1)^s\ \cdot\ E\ \cdot\ \log_2( 2 )\ +\ \log_2(M) = \pm\ E\ +\ \log_2( M )\

This result demonstrates that when taking base 2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm. Because is an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. Because falls in the range the value of will fall in the range so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values of less than

\ \sqrt\

require 53 bits to the right of the radix point and values of less than

\ \sqrt

require 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point. The final part of the

calculation is computing The "intermediate result" consists of an integer part "" added to a fractional part "". If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both "" and "" are negative numbers. For positive intermediate results: :

\ 2^ \mathsf = 2^ = 2^I\ 2^F\

For negative intermediate results: :

\ 2^ = 2^ = 2^ = 2^ = 2^\ 2^\

Thus the integer part of the intermediate result ("" or plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result: or becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits. In conclusion, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority of

computations involving double-precision numbers. The number of bits needed for the exponent of the extended-precision format follows from the requirement that the product of two double-precision numbers should not overflow when computed using the extended format. The largest possible exponent of a double-precision value is 1023 so the exponent of the largest possible product of two double-precision numbers is 2047 (an 11-bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide. Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended-precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80-bit format (in particular the IEEE 754 standard requires the exponent range of an extended-precision format to match that of the next largest,

quad QUaD, an acronym for QUEST at DASI, was a ground-based cosmic microwave background (CMB) polarization experiment at the South Pole. QUEST (Q and U Extragalactic Sub-mm Telescope) was the original name attributed to the bolometer detector instrume ...

, precision format which is 15 bits). Another example of calculations that benefit from extended precision arithmetic are iterative refinement schemes, used to indirectly clean out errors accumulated in the direct solution during the typically very large number of calculations made for numerical linear algebra.

Language support

* Some C / C++ implementations (e.g.,

GNU Compiler Collection The GNU Compiler Collection (GCC) is a collection of compilers from the GNU Project that support various programming languages, Computer architecture, hardware architectures, and operating systems. The Free Software Foundation (FSF) distributes ...

(GCC),

Clang Clang () is a compiler front end for the programming languages C, C++, Objective-C, Objective-C++, and the software frameworks OpenMP, OpenCL, RenderScript, CUDA, SYCL, and HIP. It acts as a drop-in replacement for the GNU Compiler ...

, Intel C++) implement

long double 



In  C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double. As with C's other float ...

using 80-bit floating-point numbers on x86 systems. However, this is implementation-defined behavior and is not required, but allowed by the standard, as specified for IEEE 754 hardware in the C99 standard "Annex F IEC 60559 floating-point arithmetic". GCC also provides __float80 and __float128 types. * Some

Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...

implementations (e.g. CMU Common Lisp, Embeddable Common Lisp) implement long-float using 80-bit floating-point numbers on x86 systems. * The D programming language implements real using the largest floating-point size implemented in hardware, for example 80 bits for

CPUs. On other machines, this will be the widest floating-point type natively supported by the CPU, or 64-bit double precision, whichever is wider. *

(and

Object Pascal Object Pascal is an extension to the programming language Pascal (programming language), Pascal that provides object-oriented programming (OOP) features such as Class (computer programming), classes and Method (computer programming), methods. T ...

Delphi Delphi (; ), in legend previously called Pytho (Πυθώ), was an ancient sacred precinct and the seat of Pythia, the major oracle who was consulted about important decisions throughout the ancient Classical antiquity, classical world. The A ...

) has an extended 80-bit type available in addition to real / single (32 bits) and double (64 bits), either natively (when a 80x87 coprocessor is present) or emulated (through the Turbo87 library); this extended type is available on 16-, 32-, and 64-bit platforms, possibly with padding. * The Racket run-time system provides the 80-bit extflonum datatype on x86 systems. * The

Swift Swift or SWIFT most commonly refers to: * SWIFT, an international organization facilitating transactions between banks ** SWIFT code * Swift (programming language) * Swift (bird), a family of birds It may also refer to: Organizations * SWIF ...

standard library provides the Float80 datatype. * The

PowerBASIC PowerBASIC, formerly Turbo Basic, is the brand of several commercial compilers by PowerBASIC Inc. that compile a dialect of the BASIC programming language. There are both MS-DOS and Microsoft Windows, Windows versions, and two kinds of the latte ...

BASIC compiler provides EXT or EXTENDED 10-byte extended-precision floating-point data type. * Zig provides a f80 type since version 0.10.0.

Footnotes

References

{{DEFAULTSORT:Extended Precision Computer arithmetic Floating point types