IEEE 754-1985 was an industry
standard Standard may refer to:
Symbols
* Colours, standards and guidons, kinds of military signs
* Standard (emblem), a type of a large symbol or emblem used for identification
Norms, conventions or requirements
* Standard (metrology), an object th ...
for representing
floating-point
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can b ...
numbers in
computers
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These programs ...
, officially adopted in 1985 and superseded in 2008 by
IEEE 754-2008
The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
, and then again in 2019 by minor revision
IEEE 754-2019
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in ...
. During its 23 years, it was the most widely used format for floating-point computation. It was implemented in software, in the form of floating-point
libraries
A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...
, and in hardware, in the
instructions
Instruction or instructions may refer to:
Computing
* Instruction, one operation of a processor within a computer architecture instruction set
* Computer program, a collection of instructions
Music
* Instruction (band), a 2002 rock band from Ne ...
of many
CPUs and
FPUs. The first
integrated circuit
An integrated circuit or monolithic integrated circuit (also referred to as an IC, a chip, or a microchip) is a set of electronic circuits on one small flat piece (or "chip") of semiconductor material, usually silicon. Large numbers of tiny ...
to implement the draft of what was to become IEEE 754-1985 was the
Intel 8087
The Intel 8087, announced in 1980, was the first x87 floating-point coprocessor for the 8086 line of microprocessors.
The purpose of the 8087 was to speed up computations for floating-point arithmetic, such as addition, subtraction, multiplicati ...
.
IEEE 754-1985 represents numbers in
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
, providing definitions for four levels of precision, of which the two most commonly used are:
The standard also defines representations for positive and negative
infinity
Infinity is that which is boundless, endless, or larger than any natural number. It is often denoted by the infinity symbol .
Since the time of the ancient Greeks, the philosophical nature of infinity was the subject of many discussions amo ...
, a "
negative zero
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are identical. However, in computing, some number representations allow for the existence of two zeros, often denoted by ...
", five exceptions to handle invalid results like
division by zero
In mathematics, division by zero is division (mathematics), division where the divisor (denominator) is 0, zero. Such a division can be formally expression (mathematics), expressed as \tfrac, where is the dividend (numerator). In ordinary ari ...
, special values called
NaN
Nan or NAN may refer to:
Places China
* Nan County, Yiyang, Hunan, China
* Nan Commandery, historical commandery in Hubei, China
Thailand
* Nan Province
** Nan, Thailand, the administrative capital of Nan Province
* Nan River
People Given name
...
s for representing those exceptions,
denormal numbers to represent numbers smaller than shown above, and four
rounding
Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with .
Rounding is often done to obta ...
modes.
Representation of numbers
Floating-point numbers in IEEE 754 format consist of three fields: a
sign bit
In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...
, a
biased exponent
In IEEE 754 floating-point numbers, the exponent is biased in the engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent.
Biasing is done because exponents have to ...
, and a fraction. The following example illustrates the meaning of each.
The decimal number 0.15625
10 represented in binary is 0.00101
2 (that is, 1/8 + 1/32). (Subscripts indicate the number
base.) Analogous to
scientific notation
Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
, where numbers are written to have a single non-zero digit to the left of the decimal point, we rewrite this number so it has a single 1 bit to the left of the "binary point". We simply multiply by the appropriate power of 2 to compensate for shifting the bits left by three positions:
:
Now we can read off the fraction and the exponent: the fraction is .01
2 and the exponent is −3.
As illustrated in the pictures, the three fields in the IEEE 754 representation of this number are:
: ''sign'' = 0, because the number is positive. (1 indicates negative.)
: ''biased exponent'' = −3 + the "bias". In single precision, the bias is 127, so in this example the biased exponent is 124; in double precision, the bias is 1023, so the biased exponent in this example is 1020.
: ''fraction'' = .01000…
2.
IEEE 754 adds a
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
to the exponent so that numbers can in many cases be compared conveniently by the same hardware that compares signed
2's-complement integers. Using a biased exponent, the lesser of two positive floating-point numbers will come out "less than" the greater following the same ordering as for
sign and magnitude
In computing, signed number representations are required to encode negative numbers in binary number systems.
In mathematics, negative numbers in any base are represented by prefixing them with a minus sign ("−"). However, in RAM or CPU regist ...
integers. If two floating-point numbers have different signs, the sign-and-magnitude comparison also works with biased exponents. However, if both biased-exponent floating-point numbers are negative, then the ordering must be reversed. If the exponent were represented as, say, a 2's-complement number, comparison to see which of two numbers is greater would not be as convenient.
The leading 1 bit is omitted since all numbers except zero start with a leading 1; the leading 1 is implicit and doesn't actually need to be stored which gives an extra bit of precision for "free."
Zero
The number zero is represented specially:
: ''sign'' = 0 for
positive zero, 1 for
negative zero
Signed zero is zero with an associated sign. In ordinary arithmetic, the number 0 does not have a sign, so that −0, +0 and 0 are identical. However, in computing, some number representations allow for the existence of two zeros, often denoted by ...
.
: ''biased exponent'' = 0.
: ''fraction'' = 0.
Denormalized numbers
The number representations described above are called ''normalized,'' meaning that the implicit leading binary digit is a 1. To reduce the loss of precision when an
underflow occurs, IEEE 754 includes the ability to represent fractions smaller than are possible in the normalized representation, by making the implicit leading digit a 0. Such numbers are called
denormal. They don't include as many
significant digits
Significant figures (also known as the significant digits, ''precision'' or ''resolution'') of a number in positional notation are digits in the number that are reliable and necessary to indicate the quantity of something.
If a number expre ...
as a normalized number, but they enable a gradual loss of precision when the result of an
operation
Operation or Operations may refer to:
Arts, entertainment and media
* ''Operation'' (game), a battery-operated board game that challenges dexterity
* Operation (music), a term used in musical set theory
* ''Operations'' (magazine), Multi-Ma ...
is not exactly zero but is too close to zero to be represented by a normalized number.
A denormal number is represented with a biased exponent of all 0 bits, which represents an exponent of −126 in single precision (not −127), or −1022 in double precision (not −1023). In contrast, the smallest biased exponent representing a normal number is 1 (see
examples
Example may refer to:
* '' exempli gratia'' (e.g.), usually read out in English as "for example"
* .example, reserved as a domain name that may not be installed as a top-level domain of the Internet
** example.com, example.net, example.org, ex ...
below).
Representation of non-numbers
The biased-exponent field is filled with all 1 bits to indicate either infinity or an invalid result of a computation.
Positive and negative infinity
Positive and negative infinity are represented thus:
: ''sign'' = 0 for positive infinity, 1 for negative infinity.
: ''biased exponent'' = all 1 bits.
: ''fraction'' = all 0 bits.
NaN
Some operations of
floating-point arithmetic
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...
are invalid, such as taking the square root of a negative number. The act of reaching an invalid result is called a floating-point ''exception.'' An exceptional result is represented by a special code called a NaN, for "
Not a Number
Nan or NAN may refer to:
Places China
* Nan County, Yiyang, Hunan, China
* Nan Commandery, historical commandery in Hubei, China
Thailand
* Nan Province
** Nan, Thailand, the administrative capital of Nan Province
* Nan River
People Given name ...
". All NaNs in IEEE 754-1985 have this format:
: ''sign'' = either 0 or 1.
: ''biased exponent'' = all 1 bits.
: ''fraction'' = anything except all 0 bits (since all 0 bits represents infinity).
Range and precision
Precision is defined as the minimum difference between two successive mantissa representations; thus it is a function only in the mantissa; while the gap is defined as the difference between two successive numbers.
Single precision
Single-precision
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A floatin ...
numbers occupy 32 bits. In single precision:
* The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the exponent field and the binary value 1 in the fraction field) are
*: ±2
−23 × 2
−126 ≈ ±1.40130
* The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the exponent field and 0 in the fraction field) are
*: ±1 × 2
−126 ≈ ±1.17549
* The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the exponent field and all 1s in the fraction field) are
*: ±(2−2
−23) × 2
127 ≈ ±3.40282
Some example range and gap values for given exponents in single precision:
As an example, 16,777,217 cannot be encoded as a 32-bit float as it will be rounded to 16,777,216. This shows why floating point arithmetic is unsuitable for accounting software. However, all integers within the representable range that are a power of 2 can be stored in a 32-bit float without rounding.
Double precision
Double-precision
Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
Flo ...
numbers occupy 64 bits. In double precision:
* The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
*: ±2
−52 × 2
−1022 ≈ ±4.94066
* The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
*: ±1 × 2
−1022 ≈ ±2.22507
* The finite positive and finite negative numbers furthest from zero (represented by the value with 2046 in the Exp field and all 1s in the fraction field) are
*: ±(2−2
−52) × 2
1023 ≈ ±1.79769
Some example range and gap values for given exponents in double precision:
Extended formats
The standard also recommends extended format(s) to be used to perform internal computations at a higher precision than that required for the final result, to minimise round-off errors: the standard only specifies minimum precision and exponent requirements for such formats. The
x87 80-bit extended format is the most commonly implemented extended format that meets these requirements.
Examples
Here are some examples of single-precision IEEE 754 representations:
Comparing floating-point numbers
Every possible bit combination is either a NaN or a number with a unique value in the
affinely extended real number system
In mathematics, the affinely extended real number system is obtained from the real number system \R by adding two infinity elements: +\infty and -\infty, where the infinities are treated as actual numbers. It is useful in describing the algebra on ...
with its associated order, except for the two combinations of bits for negative zero and positive zero, which sometimes require special attention (see below). The
binary representation
A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" (one).
The base-2 numeral system is a positional notation ...
has the special property that, excluding NaNs, any two numbers can be compared as
sign and magnitude
In computing, signed number representations are required to encode negative numbers in binary number systems.
In mathematics, negative numbers in any base are represented by prefixing them with a minus sign ("−"). However, in RAM or CPU regist ...
integers (
endianness
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
issues apply). When comparing as
2's-complement integers: If the sign bits differ, the negative number precedes the positive number, so 2's complement gives the correct result (except that negative zero and positive zero should be considered equal). If both values are positive, the 2's complement comparison again gives the correct result. Otherwise (two negative numbers), the correct FP ordering is the opposite of the 2's complement ordering.
Rounding errors inherent to floating point calculations may limit the use of comparisons for checking the exact equality of results. Choosing an acceptable range is a complex topic. A common technique is to use a comparison epsilon value to perform approximate comparisons. Depending on how lenient the comparisons are, common values include
1e-6
or
1e-5
for single-precision, and
1e-14
for double-precision. Another common technique is ULP, which checks what the difference is in the last place digits, effectively checking how many steps away the two values are.
Although negative zero and positive zero are generally considered equal for comparison purposes, some
programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language.
The description of a programming ...
relational operator
In computer science, a relational operator is a programming language construct or operator that tests or defines some kind of relation between two entities. These include numerical equality (''e.g.'', ) and inequalities (''e.g.'', ).
In prog ...
s and similar constructs treat them as distinct. According to the
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
Language Specification, comparison and equality operators treat them as equal, but
Math.min()
and
Math.max()
distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods
equals()
,
compareTo()
and even
compare()
of classes
Float
and
Double
.
Rounding floating-point numbers
The IEEE standard has four different rounding modes; the first is the default; the others are called ''
directed rounding
Rounding means replacing a number with an approximation, approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with .
Rounding is oft ...
s''.
* Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which means it is rounded up 50% of the time (in
IEEE 754-2008
The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...
this mode is called ''roundTiesToEven'' to distinguish it from another round-to-nearest mode)
* Round toward 0 – directed rounding towards zero
* Round toward +∞ – directed rounding towards positive infinity
* Round toward −∞ – directed rounding towards negative infinity.
Extending the real numbers
The IEEE standard employs (and extends) the
affinely extended real number system
In mathematics, the affinely extended real number system is obtained from the real number system \R by adding two infinity elements: +\infty and -\infty, where the infinities are treated as actual numbers. It is useful in describing the algebra on ...
, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the
projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The
Intel 8087
The Intel 8087, announced in 1980, was the first x87 floating-point coprocessor for the 8086 line of microprocessors.
The purpose of the 8087 was to speed up computations for floating-point arithmetic, such as addition, subtraction, multiplicati ...
and
Intel 80287
x87 is a floating-point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating-point coprocessor#Intel coprocessors, coprocessors that worked in tandem wit ...
floating point co-processors both support this projective mode.
Functions and predicates
Standard operations
The following functions must be provided:
*
Add, subtract, multiply, divide
*
Square root
In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or ⋅ ) is . For example, 4 and −4 are square roots of 16, because .
E ...
*Floating point remainder. This is not like a normal
modulo operation
In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the '' modulus'' of the operation).
Given two positive numbers and , modulo (often abbreviated as ) is th ...
, it can be negative for two positive numbers. It returns the exact value of .
*
Round to nearest integer. For undirected rounding when halfway between two integers the even integer is chosen.
*Comparison operations. Besides the more obvious results, IEEE 754 defines that −∞ = −∞, +∞ = +∞ and
x ≠
NaN
for any
x (including
NaN
).
Recommended functions and predicates
*
copysign(x,y)
returns x with the sign of y, so
abs(x)
equals
copysign(x,1.0)
. This is one of the few operations which operates on a NaN in a way resembling arithmetic. The function
copysign
is new in the C99 standard.
* −x returns x with the sign reversed. This is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.
*
scalb(y, N)
*
logb(x)
*
finite(x)
a
predicate
Predicate or predication may refer to:
* Predicate (grammar), in linguistics
* Predication (philosophy)
* several closely related uses in mathematics and formal logic:
**Predicate (mathematical logic)
**Propositional function
**Finitary relation, o ...
for "x is a finite value", equivalent to −Inf < x < Inf
*
isnan(x)
a predicate for "x is a NaN", equivalent to "x ≠ x"
*
x <> y
, which turns out to have different behavior than NOT(x = y) due to NaN.
*
unordered(x, y)
is true when "x is unordered with y", i.e., either x or y is a NaN.
*
class(x)
*
nextafter(x,y)
returns the next representable value from x in the direction towards y
History
In 1976,
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
was starting the development of a floating-point
coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor (the CPU). Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography o ...
.
Intel hoped to be able to sell a chip containing good implementations of all the operations found in the widely varying maths software libraries.
John Palmer, who managed the project, believed the effort should be backed by a standard unifying floating point operations across disparate processors. He contacted
William Kahan
William "Velvel" Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist, who received the Turing Award in 1989 for "''his fundamental contributions to numerical analysis''",
was named an ACM Fellow in 1994, and inducte ...
of the
University of California
The University of California (UC) is a public land-grant research university system in the U.S. state of California. The system is composed of the campuses at Berkeley, Davis, Irvine, Los Angeles, Merced, Riverside, San Diego, San Francisco, ...
, who had helped improve the accuracy of
Hewlett-Packard
The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...
's calculators. Kahan suggested that Intel use the floating point of
Digital Equipment Corporation
Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president unt ...
's (DEC) VAX. The first VAX, the
VAX-11/780
The VAX-11 is a discontinued family of 32-bit superminicomputers, running the Virtual Address eXtension (VAX) instruction set architecture (ISA), developed and manufactured by Digital Equipment Corporation (DEC). Development began in 1976. In ad ...
had just come out in late 1977, and its floating point was highly regarded. However, seeking to market their chip to the broadest possible market, Intel wanted the best floating point possible, and Kahan went on to draw up specifications.
Kahan initially recommended that the floating point base be decimal but the hardware design of the coprocessor was too far along to make that change.
The work within Intel worried other vendors, who set up a standardization effort to ensure a "level playing field". Kahan attended the second IEEE 754 standards working group meeting, held in November 1977. He subsequently received permission from Intel to put forward a draft proposal based on his work for their coprocessor; he was allowed to explain details of the format and its rationale, but not anything related to Intel's implementation architecture. The draft was co-written with Jerome Coonen and
Harold Stone, and was initially known as the "Kahan-Coonen-Stone proposal" or "K-C-S format".
As an 8-bit exponent was not wide enough for some operations desired for double-precision numbers, e.g. to store the product of two 32-bit numbers,
both Kahan's proposal and a counter-proposal by DEC therefore used 11 bits, like the time-tested
60-bit floating-point format of the
CDC 6600
The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
from 1965.
Kahan's proposal also provided for infinities, which are useful when dealing with division-by-zero conditions; not-a-number values, which are useful when dealing with invalid operations;
denormal number
In computer science, subnormal numbers are the subset of denormalized numbers (sometimes called denormals) that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal ...
s, which help mitigate problems caused by underflow;
and a better balanced
exponent bias
In IEEE 754 Floating-point arithmetic, floating-point numbers, the exponent is biased in the biasing, engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent.
Biasing is ...
, which can help avoid overflow and underflow when taking the reciprocal of a number.
Even before it was approved, the draft standard had been implemented by a number of manufacturers.
The Intel 8087, which was announced in 1980, was the first chip to implement the draft standard.
In 1980, the
Intel 8087
The Intel 8087, announced in 1980, was the first x87 floating-point coprocessor for the 8086 line of microprocessors.
The purpose of the 8087 was to speed up computations for floating-point arithmetic, such as addition, subtraction, multiplicati ...
chip was already released,
but DEC remained opposed, to denormal numbers in particular, because of performance concerns and since it would give DEC a competitive advantage to standardise on DEC's format.
The arguments over
gradual underflow
In computer science, subnormal numbers are the subset of denormalized numbers (sometimes called denormals) that fill the underflow gap around zero in floating-point arithmetic. Any non-zero number with magnitude smaller than the smallest normal n ...
lasted until 1981 when an expert hired by
DEC to assess it sided against the dissenters. DEC had the study done in order to demonstrate that gradual underflow was a bad idea, but the study concluded the opposite, and DEC gave in. In 1985, the standard was ratified, but it had already become the de facto standard a year earlier, implemented by many manufacturers.
See also
*
IEEE 754
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found i ...
*
Minifloat
In computing, minifloats are floating-point values represented with very few bits. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iter ...
for simple examples of properties of IEEE 754 floating point numbers
*
Fixed-point arithmetic
In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representi ...
Notes
References
Further reading
*
*
*
* : A compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing.
External links
Comparing floatsCoprocessor.info: x87 FPU pictures, development and manufacturer information— History and minutes
{{DEFAULTSORT:Ieee 754-1985
Computer arithmetic
IEEE standards
Floating point
Computer-related introductions in 1985