In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...
, floating-point arithmetic (FP) is
arithmetic that represents
real number
In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every ...
s approximately, using an
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...
with a fixed precision, called the
significand, scaled by an integer
exponent of a fixed base. For example, 12.345 can be represented as a base-ten floating-point number:
In practice, most floating-point systems use base two, though base ten (
decimal floating point
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when convert ...
) is also common.
The term ''floating point'' refers to the fact that the number's
radix point
A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The choi ...
can "float" anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating point can be considered a form of
scientific notation
Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
.
A floating-point system can be used to represent, with a fixed number of digits, numbers of very different
orders of magnitude
An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic dis ...
— such as the number of meters
between galaxies or
between protons in an atom. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this
dynamic range
Dynamic range (abbreviated DR, DNR, or DYR) is the ratio between the largest and smallest values that a certain quantity can assume. It is often used in the context of Signal (electrical engineering), signals, like sound and light. It is measured ...
is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.
Over the years, a variety of floating-point representations have been used in computers. In 1985, the
IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.
The speed of floating-point operations, commonly measured in terms of
FLOPS, is an important characteristic of a
computer system
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations ( computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...
, especially for applications that involve intensive mathematical calculations.
A
floating-point unit (FPU, colloquially a math
coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.
Overview
Floating-point numbers
A
number representation specifies some way of encoding a number, usually as a string of digits.
There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the
radix point
A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The choi ...
is indicated by placing an explicit
"point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...
and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In
fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might be to use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.
In
scientific notation
Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
, the given number is scaled by a
power of 10
A power of 10 is any of the integer powers of the number ten; in other words, ten multiplied by itself a certain number of times (when the power is a positive integer). By definition, the number one is a power (the zeroth power) of ten. The f ...
, so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the orbital period of
Jupiter
Jupiter is the fifth planet from the Sun and the largest in the Solar System. It is a gas giant with a mass more than two and a half times that of all the other planets in the Solar System combined, but slightly less than one-thousandth t ...
's moon
Io is seconds, a value that would be represented in standard-form scientific notation as seconds.
Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
* A signed (meaning positive or negative) digit string of a given length in a given
base (or
radix). This digit string is referred to as the ''
significand'', ''mantissa'', or ''coefficient''.
The length of the significand determines the ''precision'' to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
* A signed integer
exponent (also referred to as the ''characteristic'', or ''scale''),
which modifies the magnitude of the number.
To derive the value of the floating-point number, the ''significand'' is multiplied by the ''base'' raised to the power of the ''exponent'', equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
Using base-10 (the familiar
decimal notation) as an example, the number , which has ten decimal digits of precision, is represented as the significand together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by to give , or . In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Symbolically, this final value is:
where is the significand (ignoring any implied decimal point), is the precision (the number of digits in the significand), is the base (in our example, this is the number ''ten''), and is the exponent.
Historically, several number bases have been used for representing floating-point numbers, with base two (
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that ta ...
) being the most common, followed by base ten (
decimal floating point
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when convert ...
), and other less common varieties, such as base sixteen (
hexadecimal floating point
Hexadecimal floating point may refer to:
* IBM hexadecimal floating point in the IBM System 360 and 370 series of computers and others since 1964
* Hexadecimal floating-point arithmetic in the Illinois ILLIAC III computer in 1966
* Hexadecimal f ...
), base eight (octal floating point
), base four (quaternary floating point
), base three (
balanced ternary floating point) and even base 256
and base .
A floating-point number is a
rational number
In mathematics, a rational number is a number that can be expressed as the quotient or fraction of two integers, a numerator and a non-zero denominator . For example, is a rational number, as is every integer (e.g. ). The set of all rat ...
, because it can be represented as one integer divided by another; for example is (145/100)×1000 or /100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (, or ). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in
base 3
A ternary numeral system (also called base 3 or trinary) has three as its base. Analogous to a bit, a ternary digit is a trit (trinary digit). One trit is equivalent to log2 3 (about 1.58496) bits of information.
Although ''ternary'' m ...
, it is trivial (0.1 or 1×3
−1) . The occasions on which infinite expansions occur
depend on the base and its prime factors.
The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation,
, and so the significand is a string of 24
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s. For instance, the number
π's first 33 bits are:
In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit above. The next bit, at position 24, is called the ''round bit'' or ''rounding bit''. It is used to round the 33-bit approximation to the nearest 24-bit number (there are
specific rules for halfway values, which is not the case here). This bit, which is in this example, is added to the integer formed by the leftmost 24 bits, yielding:
When this is stored in memory using the IEEE 754 encoding, this becomes the
significand . The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:
where is the precision ( in this example), is the position of the bit of the significand from the left (starting at and finishing at here) and is the exponent ( in this example).
It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called ''normalization''. For binary formats (which uses only the digits and ), this non-zero digit is necessarily . Therefore, it does not need to be represented in memory; allowing the format to have one more bit of precision. This rule is variously called the ''leading bit convention'', the ''implicit bit convention'', the ''hidden bit convention'',
or the ''assumed bit convention''.
Alternatives to floating-point numbers
The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:
*
Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
*
Logarithmic number system
A logarithmic number system (LNS) is an arithmetic system used for representing real numbers in computer and digital hardware, especially for digital signal processing.
Overview
In an LNS, a number, X, is represented by the logarithm, x, of it ...
s (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (''i.e.'', the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (
symmetric
Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
)
level-index arithmetic
The level-index (LI) representation of numbers, and its algorithms for arithmetic operations, were introduced by Charles Clenshaw and Frank Olver in 1984.
The symmetric form of the LI system and its arithmetic operations were presented by Clensh ...
(LI and SLI) of Charles Clenshaw,
Frank Olver and Peter Turner is a scheme based on a
generalized logarithm representation.
*
Tapered floating-point representation
In computing, tapered floating point (TFP) is a format similar to floating point, but with variable-sized entries for the significand and exponent instead of the fixed-length entries found in normal floating-point formats. In addition to this, ...
, which does not appear to be used in practice.
* Some simple rational numbers (''e.g.'', 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (''e.g.'', 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform
rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "
bignum" arithmetic for the individual integers.
*
Interval arithmetic
Interval arithmetic (also known as interval mathematics, interval analysis, or interval computation) is a mathematical technique used to put bounds on rounding errors and measurement errors in mathematical computation. Numerical methods usin ...
allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
*
Computer algebra system
A computer algebra system (CAS) or symbolic algebra system (SAS) is any mathematical software with the ability to manipulate mathematical expressions in a way similar to the traditional manual computations of mathematicians and scientists. The d ...
s such as
Mathematica,
Maxima, and
Maple
''Acer'' () is a genus of trees and shrubs commonly known as maples. The genus is placed in the family Sapindaceae.Stevens, P. F. (2001 onwards). Angiosperm Phylogeny Website. Version 9, June 2008 nd more or less continuously updated since http ...
can often handle irrational numbers like
or
in a completely "formal" way, without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "
" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.
History
In 1914,
Leonardo Torres y Quevedo
Leonardo Torres y Quevedo (; 28 December 1852 – 18 December 1936) was a Spanish civil engineer and mathematician of the late nineteenth and early twentieth centuries. Torres was a pioneer in the development of the radio control and automated ...
proposed a form of floating point in the course of discussing his design for a special-purpose electromechanical calculator.
In 1938,
Konrad Zuse
Konrad Ernst Otto Zuse (; 22 June 1910 – 18 December 1995) was a German civil engineer, pioneering computer scientist, inventor and businessman. His greatest achievement was the world's first programmable computer; the functional program ...
of Berlin completed the
Z1, the first binary, programmable
mechanical computer
A mechanical computer is a computer built from mechanical components such as levers and gears rather than electronic components. The most common examples are adding machines and mechanical counters, which use the turning of gears to increment out ...
;
it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit.
The more reliable
relay
A relay
Electromechanical relay schematic showing a control coil, four pairs of normally open and one pair of normally closed contacts
An automotive-style miniature relay with the dust cover taken off
A relay is an electrically operated switch ...
-based
Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as
, and it stops on undefined operations, such as
.
Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes
and NaN representations, anticipating features of the IEEE Standard by four decades.
In contrast,
von Neumann Von Neumann may refer to:
* John von Neumann (1903–1957), a Hungarian American mathematician
* Von Neumann family
* Von Neumann (surname), a German surname
* Von Neumann (crater), a lunar impact crater
See also
* Von Neumann algebra
* Von Ne ...
recommended against floating-point numbers for the 1951
IAS machine
The IAS machine was the first electronic computer built at the Institute for Advanced Study (IAS) in Princeton, New Jersey. It is sometimes called the von Neumann machine, since the paper describing its design was edited by John von Neumann, a ...
, arguing that fixed-point arithmetic is preferable.
The first ''commercial'' computer with floating-point hardware was Zuse's
Z4 computer, designed in 1942–1945. In 1946, Bell Laboratories introduced the Mark V, which implemented
decimal floating-point numbers.
The
Pilot ACE
The Pilot ACE (Automatic Computing Engine) was one of the first computers built in the United Kingdom. Built at the National Physical Laboratory (NPL) in the early 1950s, it was also one of the earliest general-purpose, stored-program computers ...
has binary floating-point arithmetic, and it became operational in 1950 at
National Physical Laboratory, UK
The National Physical Laboratory (NPL) is the national measurement standards laboratory of the United Kingdom. It is one of the most extensive government laboratories in the UK and has a prestigious reputation for its role in setting and mainta ...
. Thirty-three were later sold commercially as the
English Electric DEUCE
The DEUCE (''Digital Electronic Universal Computing Engine'') was one of the earliest British commercially available computers, built by English Electric from 1955. It was the production version of the Pilot ACE, itself a cut-down version of ...
. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.
The mass-produced
IBM 704 followed in 1954; it introduced the use of a
biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "
scientific computation
Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
" (SC) capability (see also
Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that ''general-purpose'' personal computers had floating-point capability in hardware as a standard feature.
The
UNIVAC 1100/2200 series
The UNIVAC 1100/2200 series is a series of compatible 36-bit computer systems, beginning with the UNIVAC 1107 in 1962, initially made by Sperry Rand. The series continues to be supported today by Unisys Corporation as the ClearPath Dorado Ser ...
, introduced in 1962, supported two floating-point representations:
* ''Single precision'': 36 bits, organized as a 1-bit sign, an 8-bit exponent, and a 27-bit significand.
* ''Double precision'': 72 bits, organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand.
The
IBM 7094, also introduced in 1962, supported single-precision and double-precision representations, but with no relation to the UNIVAC's representations. Indeed, in 1964, IBM introduced
hexadecimal floating-point representations in its
System/360 mainframes; these same representations are still available for use in modern
z/Architecture
z/Architecture, initially and briefly called ESA Modal Extensions (ESAME), is IBM's 64-bit complex instruction set computer (CISC) instruction set architecture, implemented by its mainframe computers. IBM introduced its first z/Architecture ...
systems. In 1998, IBM implemented IEEE-compatible binary floating-point arithmetic in its mainframes; in 2005, IBM also added IEEE-compatible decimal floating-point arithmetic.
Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the
IEEE 754 standard once the 32-bit (or 64-bit)
word
A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the
i8087 numerical coprocessor; Motorola, which was designing the
68000
The Motorola 68000 (sometimes shortened to Motorola 68k or m68k and usually pronounced "sixty-eight-thousand") is a 16/32-bit complex instruction set computer (CISC) microprocessor, introduced in 1979 by Motorola Semiconductor Products Secto ...
around the same time, gave significant input as well.
In 1989, mathematician and computer scientist
William Kahan was honored with the
Turing Award
The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in comput ...
for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor,
Harold Stone.
Among the x86 innovations are these:
* A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for
endianness
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the mos ...
).
* A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior.
* The ability of
exceptional conditions (overflow,
divide by zero, etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.
Range of floating-point numbers
A floating-point number consists of two
fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.
On a typical computer system, a ''
double-precision'' (64-bit) binary floating-point number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 2
10 = 1024, the complete range of the positive normal floating-point numbers in this format is from 2
−1022 ≈ 2 × 10
−308 to approximately 2
1024 ≈ 2 × 10
308.
The number of normal floating-point numbers in a system (''B'', ''P'', ''L'', ''U'') where
* ''B'' is the base of the system,
* ''P'' is the precision of the significand (in base ''B''),
* ''L'' is the smallest exponent of the system,
* ''U'' is the largest exponent of the system,
is
.
There is a smallest positive normal floating-point number,
: Underflow level = UFL =
,
which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.
There is a largest floating-point number,
: Overflow level = OFL =
,
which has ''B'' − 1 as the value for each digit of the significand and the largest possible value for the exponent.
In addition, there are representable values strictly between −UFL and UFL. Namely,
positive and negative zeros, as well as
subnormal number
In computer science, subnormal numbers are the subset of denormalized numbers (sometimes called denormals) that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal ...
s.
IEEE 754: floating point in modern computers
The
IEEE
The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operat ...
standardized the computer representation for binary floating-point numbers in
IEEE 754 (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was
revised in 2008. IBM mainframes support
IBM's own hexadecimal floating point format and IEEE 754-2008
decimal floating point
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when convert ...
in addition to the IEEE 754 binary format. The
Cray T90
The Cray T90 series (code-named ''Triton'' during development) was the last of a line of vector processing supercomputers manufactured by Cray Research, Inc, superseding the Cray C90 series. The first machines were shipped in 1995, and featured ...
series had an IEEE version, but the
SV1 still uses Cray floating-point format.
The standard provides for many closely related formats, differing in only a few details. Five of these formats are called ''basic formats'', and others are termed ''extended precision formats'' and ''extendable precision format''. Three formats are especially widely used in computer hardware and languages:
*
Single precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floatin ...
(binary32), usually used to represent the "float"
type in the C language family. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
*
Double precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
(binary64), usually used to represent the "double"
type in the C language family. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
*
Double extended, also ambiguously called "extended precision" format. This is a binary format that occupies at least 79 bits (80 if the hidden/implicit bit rule is not used) and its significand has a precision of at least 64 bits (about 19 decimal digits). The
C99
C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version ( C90) with new features for the language and the standard library, and helps impl ...
and
C11 C11, C.XI, C-11 or C.11 may refer to:
Transport
* C-11 Fleetster, a 1920s American light transport aircraft for use of the United States Assistant Secretary of War
* Fokker C.XI, a 1935 Dutch reconnaissance seaplane
* LET C-11, a license-build var ...
standards of the C language family, in their annex F ("IEC 60559 floating-point arithmetic"), recommend such an extended format to be provided as "
long double
In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double. As with C's other flo ...
".
A format satisfying the minimal requirements (64-bit significand precision, 15-bit exponent, thus fitting on 80 bits) is provided by the
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was intr ...
architecture. Often on such processors, this format can be used with "long double", though extended precision is not available with MSVC. For
alignment purposes, many tools store this 80-bit value in a 96-bit or 128-bit space.
On other processors, "long double" may stand for a larger format, such as quadruple precision,
or just double precision, if any form of extended precision is not available.
Increasing the precision of the floating-point representation generally reduces the amount of accumulated
round-off error caused by intermediate calculations.
Less common IEEE formats include:
*
Quadruple precision
In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.
This 128-bit quadruple precision is desi ...
(binary128). This is a binary format that occupies 128 bits (16 bytes) and its significand has a precision of 113 bits (about 34 decimal digits).
*
Decimal64
In computing, decimal64 is a decimal floating-point computer numbering format that occupies 8 bytes (64 bits) in computer memory.
It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and ta ...
and
decimal128
In computing, decimal128 is a decimal floating-point computer numbering format that occupies 16 bytes (128 bits) in computer memory. It is intended for applications where it is necessary to emulate decimal rounding exactly, such as fi ...
floating-point formats. These formats, along with the
decimal32
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes (32 bits) in computer memory.
It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financ ...
format, are intended for performing decimal rounding correctly.
*
Half precision
In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications w ...
, also called binary16, a 16-bit floating-point value. It is being used in the NVIDIA
Cg graphics language, and in the openEXR standard.
Any integer with absolute value less than 2
24 can be exactly represented in the single-precision format, and any integer with absolute value less than 2
53 can be exactly represented in the double-precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double-precision floats but only 32-bit integers.
The standard specifies some special values, and their representation: positive
infinity (), negative infinity (), a
negative zero
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are identical. However, in computing, some number representations allow for the existence of two zeros, often denoted by ...
(−0) distinct from ordinary ("positive") zero, and "not a number" values (
NaN
Nan or NAN may refer to:
Places China
* Nan County, Yiyang, Hunan, China
* Nan Commandery, historical commandery in Hubei, China
Thailand
* Nan Province
** Nan, Thailand, the administrative capital of Nan Province
* Nan River
People Given name
...
s).
Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. All finite floating-point numbers are strictly smaller than and strictly greater than , and they are ordered in the same way as their values (in the set of real numbers).
Internal representation
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand or mantissa, from left to right. For the IEEE 754 binary formats (basic and extended) which have extant hardware implementations, they are apportioned as follows:
While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and
subnormal numbers
In computer science, subnormal numbers are the subset of denormalized numbers (sometimes called denormals) that fill the arithmetic underflow, underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than ...
; values of all 1s are reserved for the infinities and NaNs. The exponent range for normal numbers is
126, 127for single precision,
1022, 1023for double, or
16382, 16383for quad. Normal numbers exclude subnormal values, zeros, infinities, and NaNs.
In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, and quad has 113.
For example, it was shown above that π, rounded to 24 bits of precision, has:
* sign = 0 ; ''e'' = 1 ; ''s'' = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in the single-precision format as
* 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB
as a
hexadecimal number.
An example of a layout for
32-bit floating point is
and the
64-bit ("double") layout is similar.
Special values
Signed zero
In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most
run-time environments, positive zero is usually printed as "
0
" and the negative zero as "
-0
". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/+0 returns positive infinity (so that the identity is maintained). Other common
functions with a discontinuity at ''x''=0 which might treat +0 and −0 differently include
log
Log most often refers to:
* Trunk (botany), the stem and main wooden axis of a tree, called logs when cut
** Logging, cutting down trees for logs
** Firewood, logs used for fuel
** Lumber or timber, converted from wood logs
* Logarithm, in mathe ...
(''x''),
signum(''x''), and the
principal square root of for any negative number ''y''. As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. For example, in IEEE 754, ''x'' = ''y'' does not always imply 1/''x'' = 1/''y'', as 0 = −0 but 1/0 ≠ 1/−0.
Subnormal numbers
Subnormal values fill the
underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap. This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero (flush to zero).
Modern floating-point hardware usually handles subnormal values (as well as normal values), and does not require software emulation for subnormals.
Infinities
The infinities of the
extended real number line can be represented in IEEE floating-point datatypes, just like ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often (depends on the rounding) used as replacement values when there is an overflow. Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "" if the programming language allows that syntax).
IEEE 754 requires infinities to be handled in a reasonable way, such as
*
*
* NaN – there is no meaningful thing to do
NaNs
IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, , or sqrt(−1). In general, NaNs will be propagated, i.e. most operations involving a NaN will result in a NaN, although functions that would give some defined result for any given floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default ''quiet'' NaNs and, optionally, ''signaling'' NaNs. A signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid operation"
exception to be signaled.
The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a
runtime system
In computer programming, a runtime system or runtime environment is a sub-system that exists both in the computer where a program is created, as well as in the computers where the program is intended to be run. The name comes from the compile t ...
to flag uninitialized variables, or extend the floating-point numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common.
IEEE 754 design rationale
It is a common misconception that the more esoteric features of the IEEE 754 standard discussed here, such as extended formats, NaN, infinities, subnormals etc., are only of interest to
numerical analysts, or for advanced numerical applications. In fact the opposite is true: these features are designed to give safe robust defaults for numerically unsophisticated programmers, in addition to supporting sophisticated numerical libraries by experts. The key designer of IEEE 754,
William Kahan notes that it is incorrect to "...
eem
The Eem (; formerly the Amer) is a river in the central Netherlands with a length of approximately .
The river is fed by the Vallei Canal and a number of Veluwe creeks, the most important of which are the Heiligenberger Beek, the Barneveldse Beek ...
features of IEEE Standard 754 for Binary Floating-Point Arithmetic that ...
renot appreciated to be features usable by none but numerical experts. The facts are quite the opposite. In 1977 those features were designed into the Intel 8087 to serve the widest possible market... Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard 754, moderately tolerant of well-meaning ignorance among programmers".
* The special values such as infinity and NaN ensure that the floating-point arithmetic is algebraically complete: every floating-point operation produces a well-defined result and will not—by default—throw a machine interrupt or trap. Moreover, the choices of special values returned in exceptional cases were designed to give the correct answer in many cases. For instance, under IEEE 754 arithmetic, continued fractions such as R(z) := 7 − 3/
_−_2_−_1/(z_−_7_+_10/ _−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_2/(z_−_3).html"_;"title="_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)will_give_the_correct_answer_on_all_inputs,_as_the_potential_divide_by_zero,_e.g._for_,_is_correctly_handled_by_giving_+infinity,_and_so_such_exceptions_can_be_safely_ignored.
_As_noted_by_Kahan,_the_unhandled_trap_consecutive_to_a_floating-point_to_16-bit_integer_conversion_overflow_that_caused_the_cluster_(spacecraft).html" ;"title="_−_2_−_2/(z_−_3).html" ;"title="_−_2_−_2/(z_−_3).html" ;"title=" − 2 − 1/(z − 7 + 10/
_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_2/(z_−_3).html"_;"title="_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)will_give_the_correct_answer_on_all_inputs,_as_the_potential_divide_by_zero,_e.g._for_,_is_correctly_handled_by_giving_+infinity,_and_so_such_exceptions_can_be_safely_ignored.
_As_noted_by_Kahan,_the_unhandled_trap_consecutive_to_a_floating-point_to_16-bit_integer_conversion_overflow_that_caused_the_cluster_(spacecraft)">loss_of_an_Ariane_5_rocket_would_not_have_happened_under_the_default_IEEE_754_floating-point_policy.
*_Subnormal_numbers_ensure_that_for_''finite''_floating-point_numbers_x_and_y,_x_−_y_=_0_if_and_only_if_x_=_y,_as_expected,_but_which_did_not_hold_under_earlier_floating-point_representations.
*_On_the_design_rationale_of_the_x87_
_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_2/(z_−_3).html"_;"title="_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)">_−_2_−_1/(z_−_7_+_10/[z_−_2_−_2/(z_−_3)will_give_the_correct_answer_on_all_inputs,_as_the_potential_divide_by_zero,_e.g._for_,_is_correctly_handled_by_giving_+infinity,_and_so_such_exceptions_can_be_safely_ignored.
_As_noted_by_Kahan,_the_unhandled_trap_consecutive_to_a_floating-point_to_16-bit_integer_conversion_overflow_that_caused_the_cluster_(spacecraft)">loss_of_an_Ariane_5_rocket_would_not_have_happened_under_the_default_IEEE_754_floating-point_policy.
*_Subnormal_numbers_ensure_that_for_''finite''_floating-point_numbers_x_and_y,_x_−_y_=_0_if_and_only_if_x_=_y,_as_expected,_but_which_did_not_hold_under_earlier_floating-point_representations.
*_On_the_design_rationale_of_the_x87_extended_precision">80-bit_format,_Kahan_notes:_"This_Extended_format_is_designed_to_be_used,_with_negligible_loss_of_speed,_for_all_but_the_simplest_arithmetic_with_float_and_double_operands._For_example,_it_should_be_used_for_scratch_variables_in_loops_that_implement_recurrences_like_polynomial_evaluation,_scalar_products,_partial_and_continued_fractions._It_often_averts_premature_Over/Underflow_or_severe_local_cancellation_that_can_spoil_simple_algorithms".
_Computing_intermediate_results_in_an_extended_format_with_high_precision_and_extended_exponent_has_precedents_in_the_historical_practice_of_scientific_significant_figures#Arithmetic.html" ;"title="extended_precision.html" ;"title=" − 2 − 2/(z − 3)"> − 2 − 1/(z − 7 + 10/[z − 2 − 2/(z − 3)">_−_2_−_2/(z_−_3).html" ;"title=" − 2 − 1/(z − 7 + 10/[z − 2 − 2/(z − 3)"> − 2 − 1/(z − 7 + 10/[z − 2 − 2/(z − 3)will give the correct answer on all inputs, as the potential divide by zero, e.g. for , is correctly handled by giving +infinity, and so such exceptions can be safely ignored.
As noted by Kahan, the unhandled trap consecutive to a floating-point to 16-bit integer conversion overflow that caused the cluster (spacecraft)">loss of an Ariane 5 rocket would not have happened under the default IEEE 754 floating-point policy.
* Subnormal numbers ensure that for ''finite'' floating-point numbers x and y, x − y = 0 if and only if x = y, as expected, but which did not hold under earlier floating-point representations.
* On the design rationale of the x87 extended precision">80-bit format, Kahan notes: "This Extended format is designed to be used, with negligible loss of speed, for all but the simplest arithmetic with float and double operands. For example, it should be used for scratch variables in loops that implement recurrences like polynomial evaluation, scalar products, partial and continued fractions. It often averts premature Over/Underflow or severe local cancellation that can spoil simple algorithms".
Computing intermediate results in an extended format with high precision and extended exponent has precedents in the historical practice of scientific significant figures#Arithmetic">calculation and in the design of scientific calculators e.g. Hewlett-Packard's financial calculators performed arithmetic and financial functions to three more significant decimals than they stored or displayed.
The implementation of extended precision enabled standard elementary function libraries to be readily developed that normally gave double precision results within one
unit in the last place
In computer science and numerical analysis, unit in the last place or unit of least precision (ulp) is the spacing between two consecutive floating-point numbers, i.e., the value the least significant digit (rightmost digit) represents if it is 1 ...
(ULP) at high speed.
* Correct rounding of values to the nearest representable value avoids systematic biases in calculations and slows the growth of errors. Rounding ties to even removes the statistical bias that can occur in adding similar figures.
* Directed rounding was intended as an aid with checking error bounds, for instance in
interval arithmetic
Interval arithmetic (also known as interval mathematics, interval analysis, or interval computation) is a mathematical technique used to put bounds on rounding errors and measurement errors in mathematical computation. Numerical methods usin ...
. It is also used in the implementation of some functions.
* The mathematical basis of the operations, in particular correct rounding, allows one to prove mathematical properties and design floating-point algorithms such as
2Sum, Fast2Sum and
Kahan summation algorithm, e.g. to improve accuracy or implement multiple-precision arithmetic subroutines relatively easily.
A property of the single- and double-precision formats is that their encoding allows one to easily sort them without using floating-point hardware. Their bits
interpreted as a
two's-complement
Two's complement is a mathematical operation to reversibly convert a positive binary number into a negative binary number with equivalent (but negative) value, using the binary digit with the greatest place value (the leftmost bit in big- endian ...
integer already sort the positives correctly, with the negatives reversed. With an
xor to flip the sign bit for positive values and all bits for negative values, all the values become sortable as unsigned integers (with ).
It is unclear whether this property is intended.
Other notable floating-point formats
In addition to the widely used
IEEE 754 standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas.
* The
Microsoft Binary Format (MBF) was developed for the Microsoft BASIC language products, including Microsoft's first ever product the
Altair BASIC
Altair BASIC is a discontinued interpreter for the BASIC programming language that ran on the MITS Altair 8800 and subsequent S-100 bus computers. It was Microsoft's first product (as Micro-Soft), distributed by MITS under a contract. Altair BASI ...
(1975),
TRS-80 LEVEL II,
CP/M's
MBASIC
MBASIC is the Microsoft BASIC implementation of BASIC for the CP/M operating system. MBASIC is a descendant of the original Altair BASIC interpreters that were among Microsoft's first products. MBASIC was one of the two versions of BASIC bundled w ...
,
IBM PC 5150
The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the IBM PC model line and the basis for the IBM PC compatible de facto standard. Released on August 12, 1981, it was created by a team ...
's
BASICA,
MS-DOS
MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few ope ...
's
GW-BASIC
GW-BASIC is a dialect of the BASIC programming language developed by Microsoft from IBM BASICA. Functionally identical to BASICA, its BASIC interpreter is a fully self-contained executable and does not need the Cassette BASIC ROM found in the ...
and
QuickBASIC
Microsoft QuickBASIC (also QB) is an Integrated Development Environment (or IDE) and compiler for the BASIC programming language that was developed by Microsoft. QuickBASIC runs mainly on DOS, though there was also a short-lived version for the c ...
prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated
Intel 8080
The Intel 8080 (''"eighty-eighty"'') is the second 8-bit microprocessor designed and manufactured by Intel. It first appeared in April 1974 and is an extended and enhanced variant of the earlier 8008 design, although without binary compatibil ...
by
Monte Davidoff
Monte Davidoff (; born 1956) is an American computer programmer.
Davidoff is from Glendale, Wisconsin. He graduated from Nicolet High School in 1974, and went on to Harvard College, where he majored in applied mathematics, the department at Harva ...
, a dormmate of
Bill Gates
William Henry Gates III (born October 28, 1955) is an American business magnate and philanthropist. He is a co-founder of Microsoft, along with his late childhood friend Paul Allen. During his career at Microsoft, Gates held the positions ...
, during spring of 1975 for the
MITS Altair 8800. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the
MITS Altair 8800 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the
MOS 6502
The MOS Technology 6502 (typically pronounced "sixty-five-oh-two" or "six-five-oh-two") William Mensch and the moderator both pronounce the 6502 microprocessor as ''"sixty-five-oh-two"''. is an 8-bit microprocessor that was designed by a small te ...
(
Apple //,
Commodore PET
The Commodore PET is a line of personal computers produced starting in 1977 by Commodore International. A single all-in-one case combines a MOS Technology 6502 microprocessor, Commodore BASIC in read-only memory, keyboard, monochrome monitor, ...
,
Atari),
Motorola 6800
The 6800 ("''sixty-eight hundred''") is an 8-bit microprocessor designed and first manufactured by Motorola in 1974. The MC6800 microprocessor was part of the M6800 Microcomputer System (latter dubbed ''68xx'') that also included serial and para ...
(MITS Altair 680) and
Motorola 6809
The Motorola 6809 ("''sixty-eight-oh-nine''") is an 8-bit computing, 8-bit microprocessor with some 16-bit computing, 16-bit features. It was designed by Motorola's Terry Ritter and Joel Boney and introduced in 1978. Although source compatible wi ...
(
TRS-80 Color Computer
The RadioShack TRS-80 Color Computer, later marketed as the Tandy Color Computer and sometimes nicknamed the CoCo, is a line of home computers developed and sold by Tandy Corporation. Despite sharing a name with the earlier TRS-80, the Color Com ...
). All Microsoft language products from 1975 through 1987 used the
Microsoft Binary Format
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
There are two main versions of the forma ...
until Microsoft adopted the IEEE-754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"),
the MBF extended-precision format (40 bits, "9-digit BASIC"),
and the MBF double-precision format (64 bits);
each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits.
* The
Bfloat16 format requires the same amount of memory (16 bits) as the
IEEE 754 half-precision format, but allocates 8 bits to the exponent instead of 5, thus providing the same range as a
IEEE 754 single-precision number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
* The TensorFloat-32
format combines the 8 bits of exponent of the Bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by
Nvidia
Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
, which provides hardware support for it in the Tensor Cores of its
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...
based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.
* The
Hopper
Hopper or hoppers may refer to:
Places
*Hopper, Illinois
* Hopper, West Virginia
* Hopper, a mountain and valley in the Hunza–Nagar District of Pakistan
* Hopper (crater), a crater on Mercury
People with the name
* Hopper (surname)
* Grace H ...
architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).
Representable numbers, conversion and rounding
By their nature, all numbers expressed in floating-point format are
rational number
In mathematics, a rational number is a number that can be expressed as the quotient or fraction of two integers, a numerator and a non-zero denominator . For example, is a rational number, as is every integer (e.g. ). The set of all rat ...
s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as
π or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (it would be rounded to one of the two straddling representable values, 12345678 × 10
1 or 12345679 × 10
1), the same applies to
non-terminating digits (. to be rounded to either .55555555 or .55555556).
When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the ''rounded value''.
Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers that appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
: ''e'' = −4; ''s'' = 1100110011001100110011001100110011...,
where, as previously, ''s'' is the significand and ''e'' is the exponent.
When rounded to 24 bits this becomes
: ''e'' = −4; ''s'' = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.
As a further example, the real number
π, represented in binary as an infinite sequence of bits is
: 11.0010010000111111011010101000100010000101101000110000100011010011...
but is
: 11.0010010000111111011011
when approximated by
rounding
Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with .
Rounding is often done to ob ...
to a precision of 24 bits.
In binary single-precision floating-point, this is represented as ''s'' = 1.10010010000111111011011 with ''e'' = 1.
This has a decimal value of
: 3.1415927410125732421875,
whereas a more accurate approximation of the true value of π is
: 3.14159265358979323846264338327950...
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the
discretization error
In numerical analysis, computational physics, and simulation, discretization error is the error resulting from the fact that a function of a continuous variable is represented in the computer by a finite number of evaluations, for example, on a ...
and is limited by the
machine epsilon.
The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a
unit in the last place
In computer science and numerical analysis, unit in the last place or unit of least precision (ulp) is the spacing between two consecutive floating-point numbers, i.e., the value the least significant digit (rightmost digit) represents if it is 1 ...
(ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22
hex and 1.45a70c24
hex, the ULP is 2×16
−8, or 2
−31. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2
−23 or about 10
−7 in single precision, and exactly 2
−53 or about 10
−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.
Rounding modes
Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. IEEE 754 requires ''correct rounding'': that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different
rounding
Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with .
Rounding is often done to ob ...
schemes (or ''rounding modes''). Historically,
truncation was the typical approach. Since the introduction of IEEE 754, the default method (''
round to nearest, ties to even'', sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result.
In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)
Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:
* round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
* round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
* round up (toward +∞; negative results thus round toward zero)
* round down (toward −∞; negative results thus round away from zero)
* round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3 and 3.9 to 3)
Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and
interval arithmetic
Interval arithmetic (also known as interval mathematics, interval analysis, or interval computation) is a mathematical technique used to put bounds on rounding errors and measurement errors in mathematical computation. Numerical methods usin ...
.
The alternative rounding modes are also useful in diagnosing numerical instability: if the results of a subroutine vary substantially between rounding to + and − infinity then it is likely numerically unstable and affected by round-off error.
Binary-to-decimal conversion with minimal number of digits
Converting a double-precision binary floating-point number to a decimal string is a common operation, but an algorithm producing results that are both accurate and minimal did not appear in print until 1990, with Steele and White's Dragon4. Some of the improvements since then include:
* David M. Gay's ''dtoa.c'', a practical open-source implementation of many ideas in Dragon4.
* Grisu3, with a 4× speedup as it removes the use of
bignums. Must be used with a fallback, as it fails for ~0.5% of cases.
* Errol3, an always-succeeding algorithm similar to, but slower than, Grisu3. Apparently not as good as an early-terminating Grisu with fallback.
* Ryū, an always-succeeding algorithm that is faster and simpler than Grisu3.
* Schubfach, an always-succeeding algorithm that is based on a similar idea to Ryū, developed almost simultaneously and independently.
Performs better than Ryū and Grisu3 in certain benchmarks.
Many modern language runtimes use Grisu3 with a Dragon4 fallback.
Decimal-to-binary conversion
The problem of parsing a decimal string into a binary FP representation is complex, with an accurate parser not appearing until Clinger's 1990 work (implemented in dtoa.c).
Further work has likewise progressed in the direction of faster parsing.
Floating-point operations
For ease of presentation and understanding, decimal
radix with 7 digit precision will be used in the examples, as in the IEEE 754 ''decimal32'' format. The fundamental principles are the same in any
radix or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, ''s'' denotes the significand and ''e'' denotes the exponent.
Addition and subtraction
A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and one then proceeds with the usual addition method:
123456.7 = 1.234567 × 10^5
101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5
Hence:
123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
= (1.234567 × 10^5) + (0.001017654 × 10^5)
= (1.234567 + 0.001017654) × 10^5
= 1.235584654 × 10^5
In detail:
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
--------------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1.235585 (final sum: 123558.5)
The lowest three digits of the second operand (654) are essentially lost. This is
round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=−3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
----------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding and normalization)
In the above conceptual examples it would appear that a large number of extra digits would need to be provided by the adder to ensure correct rounding; however, for binary addition or subtraction using careful implementation techniques only a ''guard'' bit, a ''rounding'' bit and one extra ''sticky'' bit need to be carried beyond the precision of the operands.
Another problem of loss of significance occurs when ''approximations'' to two nearly equal numbers are subtracted. In the following example ''e'' = 5; ''s'' = 1.234571 and ''e'' = 5; ''s'' = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.
e=5; s=1.234571
− e=5; s=1.234567
----------------
e=5; s=0.000004
e=−1; s=4.000000 (after rounding and normalization)
The floating-point difference is computed exactly because the numbers are close—the
Sterbenz lemma guarantees this, even in case of underflow when
gradual underflow is supported. Despite this, the difference of the original numbers is ''e'' = −1; ''s'' = 4.877000, which differs more than 20% from the difference ''e'' = −1; ''s'' = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.
This ''
cancellation'' illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in
numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...
; see also
Accuracy problems.
Multiplication and division
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.
e=3; s=4.734612
× e=5; s=5.417242
-----------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession.
In practice, the way these operations are carried out in digital logic can be quite complex (see
Booth's multiplication algorithm
Booth's multiplication algorithm is a multiplication algorithm that multiplies two signed binary numbers in two's complement notation. The algorithm was invented by Andrew Donald Booth in 1950 while doing research on crystallography at Birkbeck ...
and
Division algorithm
A division algorithm is an algorithm which, given two integers N and D, computes their quotient and/or remainder, the result of Euclidean division. Some are applied by hand, while others are employed by digital circuit designs and software.
Div ...
).
For a fast, simple method, see the
Horner method
In mathematics and computer science, Horner's method (or Horner's scheme) is an algorithm for polynomial evaluation. Although named after William George Horner, this method is much older, as it has been attributed to Joseph-Louis Lagrange by H ...
.
Literal syntax
Literals for floating-point numbers depend on languages. They typically use
e
or
E
to denote
scientific notation
Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
. The
C programming language and the
IEEE 754 standard also define a
hexadecimal literal syntax with a base-2 exponent instead of 10. In languages like
C, when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as
JavaScript
JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
), or allow overloading of numeric types (such as
Haskell
Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...
). In these cases, digit strings such as
123
may also be floating-point literals.
Examples of floating-point literals are:
*
99.9
*
-5000.12
*
6.02e23
*
-3e-45
*
0x1.fffffep+127
in C and IEEE 754
Dealing with exceptional cases
Floating-point computation in a computer can run into three kinds of problems:
* An operation can be mathematically undefined, such as ∞/∞, or
division by zero
In mathematics, division by zero is division where the divisor (denominator) is zero. Such a division can be formally expressed as \tfrac, where is the dividend (numerator). In ordinary arithmetic, the expression has no meaning, as there is ...
.
* An operation can be legal in principle, but not supported by the specific format, for example, calculating the
square root
In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or ⋅ ) is . For example, 4 and −4 are square roots of 16, because .
...
of −1 or the inverse sine of 2 (both of which result in
complex number
In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted , called the imaginary unit and satisfying the equation i^= -1; every complex number can be expressed in the fo ...
s).
* An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large),
underflow (exponent too small) or
denormalization
Denormalization is a strategy used on a previously- normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performanc ...
(precision loss).
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of
trap
A trap is a mechanical device used to capture or restrain an animal for purposes such as hunting, pest control, or ecological research.
Trap or TRAP may also refer to:
Art and entertainment Films and television
* ''Trap'' (2015 film), Fil ...
that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not
portable
Portable may refer to:
General
* Portable building, a manufactured structure that is built off site and moved in upon completion of site and utility work
* Portable classroom, a temporary building installed on the grounds of a school to provide ...
. (The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "
exception" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology.)
Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).
The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic exception flag bits. So while these were implemented in hardware, initially programming language implementations typically did not provide a means to access them (apart from assembler). Over time some programming language standards (e.g.,
C99
C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version ( C90) with new features for the language and the standard library, and helps impl ...
/C11 and Fortran) have been updated to specify methods to access and change status flag bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic flag bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a
means
Means may refer to:
* Means LLC, an anti-capitalist media worker cooperative
* Means (band), a Christian hardcore band from Regina, Saskatchewan
* Means, Kentucky, a town in the US
* Means (surname)
* Means Johnston Jr. (1916–1989), US Navy adm ...
outside of the standard (e.g.
C11 C11, C.XI, C-11 or C.11 may refer to:
Transport
* C-11 Fleetster, a 1920s American light transport aircraft for use of the United States Assistant Secretary of War
* Fokker C.XI, a 1935 Dutch reconnaissance seaplane
* LET C-11, a license-build var ...
specifies that the flags have
thread-local storage
Thread-local storage (TLS) is a computer programming method that uses static or global memory local to a thread.
While the use of global variables is generally discouraged in modern programming, legacy operating systems such as UNIX are designed ...
).
IEEE 754 specifies five arithmetic exceptions that are to be recorded in the status flags ("sticky bits"):
* inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
* underflow, set if the rounded value is tiny (as specified in IEEE 754) ''and'' inexact (or maybe limited to if it has denormalization loss, as per the 1985 version of IEEE 754), returning a subnormal value including the zeros.
* overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
* divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
* invalid, set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.
The default return value for each of the exceptions is designed to give the correct result in the majority of cases such that the exceptions can be ignored in the majority of codes. ''inexact'' returns a correctly rounded result, and ''underflow'' returns a value less than or equal to the smallest positive normal number in magnitude and can almost always be ignored.
''divide-by-zero'' returns infinity exactly, which will typically then divide a finite number and so give zero, or else will give an ''invalid'' exception subsequently if not, and so can also typically be ignored. For example, the effective resistance of n resistors in parallel (see fig. 1) is given by
. If a short-circuit develops with
set to 0,
will return +infinity which will give a final
of 0, as expected
(see the continued fraction example of
IEEE 754 design rationale for another example).
''Overflow'' and ''invalid'' exceptions can typically not be ignored, but do not necessarily represent errors: for example, a
root-finding routine, as part of its normal operation, may evaluate a passed-in function at values outside of its domain, returning NaN and an ''invalid'' exception flag to be ignored until finding a useful start point.
Accuracy problems
The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This is related to the finite
precision
Precision, precise or precisely may refer to:
Science, and technology, and mathematics Mathematics and computing (general)
* Accuracy and precision, measurement deviation from true value and its scatter
* Significant figures, the number of digit ...
with which computers generally represent numbers.
For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as ; , which is
Squaring this number gives
Squaring it with single-precision floating-point hardware (with rounding) gives
But the representable number closest to 0.01 is
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
/* Enough digits to be sure we get the correct approximation. */
double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);
will give a result of 16331239353195370.0. In single precision (using the
tanf
function), the result will be −22877332.0.
By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225 in double precision, or −0.8742 in single precision.
While floating-point addition and multiplication are both
commutative
In mathematics, a binary operation is commutative if changing the order of the operands does not change the result. It is a fundamental property of many binary operations, and many mathematical proofs depend on it. Most familiar as the name of ...
( and ), they are not necessarily
associative. That is, is not necessarily equal to . Using 7-digit significand decimal arithmetic:
a = 1234.567, b = 45.67834, c = 0.0004
(a + b) + c:
1234.567 (a)
+ 45.67834 (b)
____________
1280.24534 rounds to 1280.245
1280.245 (a + b)
+ 0.0004 (c)
____________
1280.2454 rounds to 1280.245 ← (a + b) + c
a + (b + c):
45.67834 (b)
+ 0.0004 (c)
____________
45.67874
1234.567 (a)
+ 45.67874 (b + c)
____________
1280.24574 rounds to 1280.246 ← a + (b + c)
They are also not necessarily
distributive. That is, may not be the same as :
1234.567 × 3.333333 = 4115.223
1.234567 × 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
but
1234.567 + 1.234567 = 1235.802
1235.802 × 3.333333 = 4119.340
In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
Incidents
* On 25 February 1991, a
loss of significance
In numerical analysis, catastrophic cancellation is the phenomenon that subtracting good approximations to two nearby numbers may yield a very bad approximation to the difference of the original numbers.
For example, if there are two studs, one L_ ...
in a
MIM-104 Patriot missile battery
prevented it from intercepting an incoming
Scud missile in
Dhahran
Dhahran ( ar, الظهران, ''Al-Dhahran'') is a city located in the Eastern Province, Saudi Arabia. With a total population of 240,742 as of 2021, it is a major administrative center for the Saudi oil industry. Together with the nearby citi ...
,
Saudi Arabia
Saudi Arabia, officially the Kingdom of Saudi Arabia (KSA), is a country in Western Asia. It covers the bulk of the Arabian Peninsula, and has a land area of about , making it the fifth-largest country in Asia, the second-largest in the A ...
, contributing to the death of 28 soldiers from the U.S. Army's
14th Quartermaster Detachment.
Machine precision and backward error analysis
''Machine precision'' is a quantity that characterizes the accuracy of a floating-point system, and is used in
backward error analysis of floating-point algorithms. It is also known as unit roundoff or ''
machine epsilon''. Usually denoted , its value depends on the particular rounding being used.
With rounding to zero,
whereas rounding to nearest,
where ''B'' is the base of the system and ''P'' is the precision of the significand (in base ''B'').
This is important since it bounds the ''
relative error
The approximation error in a data value is the discrepancy between an exact value and some '' approximation'' to it. This error can be expressed as an absolute error (the numerical amount of the discrepancy) or as a relative error (the absolute e ...
'' in representing any non-zero real number within the normalized range of a floating-point system:
Backward error analysis, the theory of which was developed and popularized by
James H. Wilkinson, can be used to establish that an algorithm implementing a numerical function is numerically stable.
The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data "deserves". The algorithm is then defined as ''
backward stable''. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the
condition number of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.
As a trivial example, consider a simple expression giving the inner product of (length two) vectors
and
, then
and so
where
where
by definition, which is the sum of two slightly perturbed (on the order of Ε
mach) input data, and so is backward stable. For more realistic examples in
numerical linear algebra
Numerical linear algebra, sometimes called applied linear algebra, is the study of how matrix operations can be used to create computer algorithms which efficiently and accurately provide approximate answers to questions in continuous mathematic ...
, see Higham 2002
and other references below.
Minimizing the effect of accuracy problems
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a
ULP, more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are
ill-conditioned
In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...
, meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm
numerically unstable for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as
numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...
. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,
which can remove, or reduce by orders of magnitude,
such risk:
IEEE 754 quadruple precision and
extended precision are designed for this purpose when computing at double precision.
For example, the following algorithm is a direct implementation to compute the function which is well-conditioned at 1.0,
however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.
double A(double X)
If, however, intermediate computations are all performed in extended precision (e.g. by setting line
to
C99
C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version ( C90) with new features for the language and the standard library, and helps impl ...
), then up to full precision in the final double result can be maintained.
Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line
is made:
Z = log(Z) / (Z - 1.0);
then the algorithm becomes numerically stable and can compute to full double precision.
To maintain the properties of such carefully constructed numerically stable programs, careful handling by the
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.
A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to,
and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude
the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results
); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures:
notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.
As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a
proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.
An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.
The "decimal" data type of the
C# and
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
programming languages, and the decimal formats of the
IEEE 754-2008
The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.
Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that
, and that
, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.
The use of the equality test (
if (xy) ...
) requires care when dealing with floating-point numbers. Even simple expressions like
0.6/0.2-30
will, on most computers, fail to be true
(in IEEE 754 double precision, for example,
0.6/0.2 - 3
is approximately equal to -4.44089209850063e-16). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (
if (abs(x-y) < epsilon) ...
, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.
Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.
It is often better to organize the code in such a way that such tests are unnecessary. For example, in
computational geometry, exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are
matrix inversion
In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that
:\mathbf = \mathbf = \mathbf_n \
where denotes the -by- identity matrix and the multiplicati ...
,
eigenvector
In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as
iterative refinement, if they are to work well.
Summation of a vector of floating-point values is a basic algorithm in
scientific computing
Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
, and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like
3253.671
+ 3.141276
-----------
3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The
Kahan summation algorithm may be used to reduce the errors.
Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example,
Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error (
numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...
). Two forms of the recurrence formula for the circumscribed polygon are:
*
* First form:
* second form:
*
, converging as
Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:
i 6 × 2
i × t
i, first form 6 × 2
i × t
i, second form
---------------------------------------------------------
0 .4641016151377543863 .4641016151377543863
1 .2153903091734710173 .2153903091734723496
2 596599420974940120 596599420975006733
3 60862151314012979 60862151314352708
4 27145996453136334 27145996453689225
5 8730499801259536 8730499798241950
6 6627470548084133 6627470568494473
7 6101765997805905 6101766046906629
8 70343230776862 70343215275928
9 37488171150615 37487713536668
10 9278733740748 9273850979885
11 7256228504127 7220386148377
12 717412858693 707019992125
13 189011456060 78678454728
14 717412858693 46593073709
15 19358822321783 8571730119
16 717412858693 6566394222
17 810075796233302 6065061913
18 717412858693 939728836
19 4061547378810956 908393901
20 05434924008406305 900560168
21 00068646912273617 8608396
22 349453756585929919 8122118
23 00068646912273617 95552
24 .2245152435345525443 68907
25 62246
26 62246
27 62246
28 62246
The true value is
While the two forms of the recurrence formula are clearly mathematically equivalent,
the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of
significant digit
Significant figures (also known as the significant digits, ''precision'' or ''resolution'') of a number in positional notation are digits in the number that are reliable and necessary to indicate the quantity of something.
If a number expres ...
s. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.
"Fast math" optimization
The aforementioned lack of
associativity
In mathematics, the associative property is a property of some binary operations, which means that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement ...
of floating-point operations in general means that
compilers
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as
common subexpression elimination In compiler theory, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyzes whether it is worthwhile replacing them with a si ...
and auto-
vectorization
Vectorization may refer to:
Computing
* Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements
* Automatic vectorization, a compiler optimization that transforms loops to vec ...
.
The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.
In some compilers (GCC and Clang), turning on "fast" math may cause the program to
disable subnormal floats at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a
library
A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...
.
In most
Fortran compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses.
Intel Fortran Compiler
Intel Fortran Compiler, is a group of Fortran compilers from Intel for Windows, macOS, and Linux.
Overview
The compilers generate code for IA-32 and Intel 64 processors and certain non-Intel but compatible processors, such as certain AMD proces ...
is a notable outlier.
A common problem in "fast" math is that subexpressions may not be optimized identically from place to place, leading to unexpected differences. One interpretation of the issue is that "fast" math as implemented currently has a poorly defined semantics. One attempt at formalizing "fast" math optimizations is seen in ''Icing'', a verified compiler.
See also
*
Arbitrary precision
In computer science, arbitrary-precision arithmetic, also called bignum arithmetic, multiple-precision arithmetic, or sometimes infinite-precision arithmetic, indicates that calculations are performed on numbers whose digits of precision are li ...
*
C99
C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version ( C90) with new features for the language and the standard library, and helps impl ...
for code examples demonstrating access and use of IEEE 754 features.
*
Computable number
In mathematics, computable numbers are the real numbers that can be computed to within any desired precision by a finite, terminating algorithm. They are also known as the recursive numbers, effective numbers or the computable reals or recursive ...
*
Coprocessor
*
Decimal floating point
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when convert ...
*
Double precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
*
Experimental mathematics
Experimental mathematics is an approach to mathematics in which computation is used to investigate mathematical objects and identify properties and patterns. It has been defined as "that branch of mathematics that concerns itself ultimately with th ...
– utilizes high precision floating-point computations
*
Fixed-point arithmetic
*
Floating point error mitigation
*
FLOPS
*
Gal's accurate tables
*
GNU MPFR
*
Half precision
In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications w ...
*
IEEE 754 – Standard for Binary Floating-Point Arithmetic
*
IBM Floating Point Architecture
*
Kahan summation algorithm
*
Microsoft Binary Format
In computing, Microsoft Binary Format (MBF) is a format for floating-point numbers which was used in Microsoft's BASIC language products, including MBASIC, GW-BASIC and QuickBASIC prior to version 4.00.
There are two main versions of the forma ...
(MBF)
*
Minifloat
In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iter ...
*
Q (number format) for constant resolution
*
Quadruple precision
In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.
This 128-bit quadruple precision is desi ...
(including double-double)
*
Significant digits
*
Single precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floatin ...
Notes
References
Further reading
* (NB. Classic influential treatises on floating-point arithmetic.)
*
*
*
* (NB. Edition with source code CD-ROM.)
*
* (1213 pages) (NB. This is a single-volume edition. This work was also available in a two-volume version.)
*
*
External links
* (NB. This page gives a very brief summary of floating-point formats that have been used over the years.)
* (NB. A compendium of non-intuitive behaviors of floating point on popular architectures, with implications for program verification and testing.)
OpenCores (NB. This website contains open source floating-point IP cores for the implementation of floating-point operators in FPGA or ASIC devices. The project ''double_fpu'' contains verilog source code of a double-precision floating-point unit. The project ''fpuvhdl'' contains vhdl source code of a single-precision floating-point unit.)
*
{{Data types
Computer arithmetic
Articles with example C code