Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed. Huberto M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator": The Z1, developed by

Konrad Zuse Konrad Ernst Otto Zuse (; ; 22 June 1910 – 18 December 1995) was a German civil engineer, List of pioneers in computer science, pioneering computer scientist, inventor and businessman. His greatest achievement was the world's first programm ...

in 1936, was the first computer with

floating-point arithmetic In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a ''significand'' (a Sign (mathematics), signed sequence of a fixed number of digits in some Radix, base) multiplied by an integer power of that ba ...

and was thus susceptible to floating-point error. Early computers, however, with operation times measured in milliseconds, could not solve large, complex problems and thus were seldom plagued with floating-point error. Today, however, with

supercomputer A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...

system performance measured in

petaflops Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measu ...

, floating-point error is a major concern for computational problem solvers. The following sections describe the strengths and weaknesses of various means of mitigating floating-point error.

Numerical error analysis

Though not the primary focus of

numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic computation, symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of ...

, numerical error analysis exists for the analysis and minimization of floating-point rounding error.

Monte Carlo arithmetic

Error analysis by

Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...

arithmetic is accomplished by repeatedly injecting small errors into an algorithm's data values and determining the relative effect on the results.

Extension of precision

Extension of precision is using of larger representations of real values than the one initially considered. The

IEEE 754 The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic originally established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard #Design rationale, add ...

standard defines precision as the number of digits available to represent real numbers. A programming language can include

single precision Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floa ...

(32 bits),

double precision Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point arithmetic, floating-point computer number format, number format, usually occupying 64 Bit, bits in computer memory; it represents a wide range of numeri ...

(64 bits), and

quadruple precision In computing, quadruple precision (or quad precision) is a binary Floating-point arithmetic, floating-point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit Double-precision floating-point ...

(128 bits). While extension of precision makes the effects of error less likely or less important, the true accuracy of the results is still unknown.

Variable-length arithmetic

Variable length arithmetic represents numbers as a string of digits of a variable's length limited only by the memory available. Variable-length arithmetic operations are considerably slower than fixed-length format floating-point instructions. When high performance is not a requirement, but high precision is, variable length arithmetic can prove useful, though the actual accuracy of the result may not be known.

Use of the error term of a floating-point operation

The floating-point algorithm known as ''TwoSum'' or ''

2Sum 2Sum is a floating-point algorithm for computing the exact round-off error in a floating-point addition operation. 2Sum and its variant Fast2Sum were first published by Ole Møller in 1965. Fast2Sum is often used implicitly in other algorithms suc ...

'', due to Knuth and Møller, and its simpler, but restricted version ''FastTwoSum'' or ''Fast2Sum'' (3 operations instead of 6), allow one to get the (exact) error term of a floating-point addition rounded to nearest. One can also obtain the (exact) error term of a floating-point multiplication rounded to nearest in 2 operations with a

fused multiply–add Fuse or FUSE may refer to: Devices * Fuse (electrical) In electronics and electrical engineering, a fuse is an electrical safety device that operates to provide overcurrent protection of an electrical circuit. Its essential component is a me ...

(FMA), or 17 operations if the FMA is not available (with an algorithm due to Dekker). These error terms can be used in algorithms in order to improve the accuracy of the final result, e.g. with floating-point expansions or compensated algorithms. Operations giving the result of a floating-point addition or multiplication rounded to nearest with its error term (but slightly differing from algorithms mentioned above) have been standardized and recommended in the IEEE 754-2019 standard.

Choice of a different radix

Changing the

radix In a positional numeral system, the radix (radices) or base is the number of unique digits, including the digit zero, used to represent numbers. For example, for the decimal system (the most common system in use today) the radix is ten, becaus ...

, in particular from binary to decimal, can help to reduce the error and better control the rounding in some applications, such as

financial Finance refers to monetary resources and to the study and Academic discipline, discipline of money, currency, assets and Liability (financial accounting), liabilities. As a subject of study, is a field of Business administration, Business Admin ...

applications.

Interval arithmetic

Interval arithmetic Interval arithmetic (also known as interval mathematics; interval analysis or interval computation) is a mathematical technique used to mitigate rounding and measurement errors in mathematical computation by computing function bounds. Numeri ...

is a mathematical technique used to put bounds on

rounding error In computing, a roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Roun ...

s and

measurement error Observational error (or measurement error) is the difference between a measured value of a quantity and its unknown true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. Such errors are inherent in the measurement pr ...

s in mathematical computation. Values are intervals, which can be represented in various ways, such as: * inf-sup: a lower bound and an upper bound on the true value; * mid-rad: an approximation and an error bound (called ''midpoint'' and ''radius'' of the interval); * triplex: an approximation, a lower bound and an upper bound on the error.

"Instead of using a single floating-point number as approximation for the value of a real variable in the mathematical model under investigation, interval arithmetic acknowledges limited precision by associating with the variable a set of reals as possible values. For ease of storage and computation, these sets are restricted to intervals."

The evaluation of interval arithmetic expression may provide a large range of values, and may seriously overestimate the true error boundaries.

Gustafson's unums

Unums ("Universal Numbers") are an extension of variable length arithmetic proposed by John Gustafson.
/ref> Unums have variable length fields for the exponent and

significand The significand (also coefficient, sometimes argument, or more ambiguously mantissa, fraction, or characteristic) is the first (left) part of a number in scientific notation or related concepts in floating-point representation, consisting of its s ...

lengths and error information is carried in a single bit, the ubit, representing possible error in the least significant bit of the significand ( ULP). The efficacy of unums is questioned by

William Kahan William "Velvel" Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist, who is a professor emeritus at University of California, Berkeley. He received the Turing Award in 1989 for "his fundamental contributions to nu ...

Bounded floating point

Bounded floating point is a method proposed and patented by Alan Jorgensen. The data structure includes the standard

data structure and interpretation, as well as information about the error between the true real value represented and the value stored by the floating point representation. Bounded floating point has been criticized as being derivative of Gustafson's work on unums and interval arithmetic.

References

{{reflist Floating point Computer arithmetic Error