A roundoff error, also called rounding error, is the difference between the result produced by a given

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of

quantization error Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and ...

. When using approximation

equation In mathematics, an equation is a formula that expresses the equality of two expressions, by connecting them with the equals sign . The word ''equation'' and its cognates in other languages may have subtly different meanings; for example, in F ...

s or algorithms, especially when using finitely many digits to represent real numbers (which in theory have infinitely many digits), one of the goals of

numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods ...

is to

estimate Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...

computation errors. Computation errors, also called numerical errors, include both

truncation error In numerical analysis and scientific computing, truncation error is an error caused by approximating a mathematical process. Examples Infinite series A summation series for e^x is given by an infinite series such as e^x=1+ x+ \frac + \fra ...

s and roundoff errors. When a sequence of calculations with an input involving any roundoff error are made, errors may accumulate, sometimes dominating the calculation. In

ill-conditioned In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input ...

problems, significant error may accumulate. In short, there are two major facets of roundoff errors involved in numerical calculations: # The ability of computers to represent both magnitude and precision of numbers is inherently limited. # Certain numerical manipulations are highly sensitive to roundoff errors. This can result from both mathematical considerations as well as from the way in which computers perform arithmetic operations.

Representation error

The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error called representation error. Here are some examples of representation error in decimal representations: Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error for

uncountably many In mathematics, an uncountable set (or uncountably infinite set) is an infinite set that contains too many elements to be countable. The uncountability of a set is closely related to its cardinal number: a set is uncountable if its cardinal numb ...

real numbers. Additional digits used for intermediary steps of a calculation are known as guard digits. Rounding multiple times can cause error to accumulate. For example, if 9.945309 is rounded to two decimal places (9.95), then rounded again to one decimal place (10.0), the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single step introduces less error (0.045309). This can occur, for example, when software performs arithmetic in x86 80-bit floating-point and then rounds the result to IEEE 754 binary64 floating-point.

Floating-point number system

Compared with the fixed-point number system, the floating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers

\mathbb

are infinite and continuous, a floating-point number system

F

is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

Notation of floating-point number system

A floating-point number system

F

is characterized by

4

integers: *

\beta

: base or radix *

p

: precision *

, U The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

: exponent range, where

L

is the lower bound and

U

is the upper bound Any

x \in F

has the following form:

x = \pm (\underbrace_\text)_  \times \beta ^ = \pm d_\times \beta ^+d_\times \beta ^+\ldots+ d_\times \beta ^

where

d_

is an integer such that

0 \leq d_ \leq \beta-1

for

i = 0, 1, \ldots, p-1

, and

E

is an integer such that

L \leq E \leq U

Normalized floating-number system

* A floating-point number system is normalized if the leading digit

d_

is always nonzero unless the number is zero. Since the mantissa is

d_.d_d_\ldots d_

, the mantissa of a nonzero number in a normalized system satisfies

1 \leq \text < \beta

. Thus, the normalized form of a nonzero

IEEE The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operati ...

floating-point number is

\pm 1.bb \ldots b \times 2^

where

b \in

. In binary, the leading digit is always

1

so it is not written out and is called the implicit bit. This gives an extra bit of precision so that the roundoff error caused by representation error is reduced. * Since floating-point number system

F

is finite and discrete, it cannot represent all real numbers which means infinite real numbers can only be approximated by some finite numbers through rounding rules. The floating-point approximation of a given real number

x

fl(x)

can be denoted. ** The total number of normalized floating-point numbers is

2(\beta -1)\beta^ (U-L+1)+1,

where ***

2

counts choice of sign, being positive or negative ***

(\beta -1)

counts choice of the leading digit ***

\beta^

counts remaining mantissa ***

U-L+1

counts choice of exponents ***

1

counts the case when the number is

0

IEEE standard

In the

standard the base is binary, i.e.

\beta = 2

, and normalization is used. The IEEE standard stores the sign, exponent, and mantissa in separate fields of a floating point word, each of which has a fixed width (number of bits). The two most commonly used levels of precision for floating-point numbers are single precision and double precision.

Machine epsilon

Machine epsilon can be used to measure the level of roundoff error in the floating-point number system. Here are two different definitions. * The machine epsilon, denoted

\epsilon_\text

, is the maximum possible absolute relative error in representing a nonzero real number

x

in a floating-point number system.

\epsilon_\text = \max_ \frac

* The machine epsilon, denoted

\epsilon_\text

, is the smallest number

\epsilon

such that

fl(1+\epsilon) > 1

. Thus,

fl(1+\delta)=fl(1)=1

whenever

, \delta,  < \epsilon_\text

Roundoff error under different rounding rules

There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest. * Round-by-chop: The base-

\beta

expansion of

x

is truncated after the

(p-1)

-th digit. ** This rounding rule is biased because it always moves the result toward zero. * Round-to-nearest:

fl(x)

is set to the nearest floating-point number to

x

. When there is a tie, the floating-point number whose last stored digit is even (also, the last digit, in binary form, is equal to 0) is used. ** For IEEE standard where the base

\beta

2

, this means when there is a tie it is rounded so that the last digit is equal to

0

. ** This rounding rule is more accurate but more computationally expensive. ** Rounding so that the last stored digit is even when there is a tie ensures that it is not rounded up or down systematically. This is to try to avoid the possibility of an unwanted slow drift in long calculations due simply to a biased rounding. * The following example illustrates the level of roundoff error under the two rounding rules. The rounding rule, round-to-nearest, leads to less roundoff error in general.

Calculating roundoff error in IEEE standard

Suppose the usage of round-to-nearest and IEEE double precision. * Example: the decimal number

(9.4)_=(1001.)_

can be rearranged into

+1.\underbrace_\text110 \ldots \times 2^

Since the

53

-rd bit to the right of the binary point is a 1 and is followed by other nonzero bits, the round-to-nearest rule requires rounding up, that is, add 1 bit to the

52

-nd bit. Thus, the normalized floating-point representation in IEEE standard of 9.4 is

fl(9.4)=1.0010110011001100110011001100110011001100110011001101 \times 2^.

* Now the roundoff error can be calculated when representing

9.4

with

fl(9.4)

. This representation is derived by discarding the infinite tail

0. \times 2^\times 2^ = 0. \times 2^ \times 2^=0.4 \times 2^

from the right tail and then added

1 \times 2^ \times 2^=2^

in the rounding step. :Then

fl(9.4) = 9.4-0.4 \times 2^ + 2^ = 9.4+(0.2)_ \times 2^

. :Thus, the roundoff error is

(0.2 \times 2^)_

Measuring roundoff error by using machine epsilon

The machine epsilon

\epsilon_\text

can be used to measure the level of roundoff error when using the two rounding rules above. Below are the formulas and corresponding proof. The first definition of machine epsilon is used here.

Theorem

# Round-by-chop:

\epsilon_\text = \beta^

# Round-to-nearest:

\epsilon_\text = \frac\beta^

Proof

Let

x=d_.d_d_ \ldots d_d_ \ldots \times \beta^ \in \mathbb

where

n \in

/math>, and let

fl(x)

be the floating-point representation of

x

. Since round-by-chop is being used, it is

\begin
\frac &= \frac\\
&= \frac\\
&= \frac \times \beta^
\end

In order to determine the maximum of this quantity, the is a need to find the maximum of the numerator and the minimum of the denominator. Since

d_\neq 0

(normalized system), the minimum value of the denominator is

1

. The numerator is bounded above by

(\beta-1).(\beta-1)=\beta

. Thus,

\frac \leq \frac \times \beta^ = \beta^

. Therefore,

\epsilon=\beta^

for round-by-chop. The proof for round-to-nearest is similar. * Note that the first definition of machine epsilon is not quite equivalent to the second definition when using the round-to-nearest rule but it is equivalent for round-by-chop.

Roundoff error caused by floating-point arithmetic

Even if some numbers can be represented exactly by floating-point numbers and such numbers are called machine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.

Addition

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error. * For example, adding

1

2^

in IEEE double precision as follows,

\begin
1.00\ldots 0 \times 2^ + 1.00\ldots 0 \times 2^ &= 1.\underbrace_\text \times 2^ + 0.\underbrace_\text1 \times 2^\\
&= 1.\underbrace_\text1\times 2^.
\end

This is saved as

1.\underbrace_\text\times 2^

since round-to-nearest is used in IEEE standard. Therefore,

1+2^

is equal to

1

in IEEE double precision and the roundoff error is

2^

. This example shows that roundoff error can be introduced when adding a large number and a small number. The shifting of the decimal points in the mantissas to make the exponents match causes the loss of some of the less significant digits. The loss of precision may be described as absorption. Note that the addition of two floating-point numbers will result in a roundoff error when their sum is an order of magnitude greater than that of the larger of the two. * For example, consider a normalized floating-point number system with base

10

and precision

2

. Then

fl(62)=6.2 \times 10^

and

fl(41) = 4.1 \times 10^

. Note that

62+41=103

but

fl(103)=1.0 \times 10^

. There is a roundoff error of

103-fl(103)=3

. This kind of error can occur alongside an absorption error in a single operation.

Multiplication

In general, the product of

2

p

-digit mantissas contains up to

2p

digits, so the result might not fit in the mantissa. Thus roundoff error will be involved in the result. * For example, consider a normalized floating-point number system with the base

\beta=10

and the mantissa digits are at most

2

. Then

fl(77) = 7.7 \times 10^

and

fl(88) = 8.8 \times 10^

. Note that

77 \times 88=6776

but

fl(6776) = 6.7 \times 10^

since there at most

2

mantissa digits. The roundoff error would be

6776 - fl(6776)  = 6776 - 6.7 \times 10^=76

Division

In general, the quotient of

2

p

-digit mantissas may contain more than

p

-digits. Thus roundoff error will be involved in the result. * For example, if the normalized floating-point number system above is still being used, then

1/3=0.333 \ldots

but

fl(1/3)=fl(0.333 \ldots)=3.3 \times 10^

. So, the tail

0.333 \ldots - 3.3 \times 10^=0.00333 \ldots

is cut off.

Subtraction

Absorption also applies to subtraction. * For example, subtracting

2^

from

1

in IEEE double precision as follows,

\begin
1.00\ldots 0 \times 2^ - 1.00\ldots 0 \times 2^ &= \underbrace_\text \times 2^ - \underbrace_\text \times 2^\\
&= \underbrace_\text\times 2^.
\end

This is saved as

\underbrace_\text\times 2^

since round-to-nearest is used in IEEE standard. Therefore,

1-2^

is equal to

1

in IEEE double precision and the roundoff error is

-2^

. The subtracting of two nearly equal numbers is called subtractive cancellation. When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as

0

. * For example, let

, \epsilon,  < \epsilon_\text

and the second definition of machine epsilon is used here. What is the solution to

(1+\epsilon) - (1-\epsilon)

? It is known that

1+\epsilon

and

1-\epsilon

are nearly equal numbers, and

(1+\epsilon) - (1-\epsilon)=1+\epsilon-1+\epsilon=2\epsilon

. However, in the floating-point number system,

fl((1+\epsilon) - (1-\epsilon))=fl(1+\epsilon)-fl(1-\epsilon)=1-1=0

. Although

2\epsilon

is easily big enough to be represented, both instances of

\epsilon

have been rounded away giving

0

. Even with a somewhat larger

\epsilon

, the result is still significantly unreliable in typical cases. There is not much faith in the accuracy of the value because the most uncertainty in any floating-point number is the digits on the far right. * For example,

1.99999 \times 10 ^- 1.99998 \times 10^ = 0.00001\times10^ =1 \times 10^\times 10^=1\times10^

. The result

1\times10^

is clearly representable, but there is not much faith in it. This is closely related to the phenomenon of catastrophic cancellation, in which the two numbers are ''known'' to be approximations.

Accumulation of roundoff error

Errors can be magnified or accumulated when a sequence of calculations is applied on an initial input with roundoff error due to inexact representation.

Unstable algorithms

An algorithm or numerical process is called stable if small changes in the input only produce small changes in the output and it is called unstable if large changes in the output are produced. A sequence of calculations normally occur when running some algorithm. The amount of error in the result depends on the stability of the algorithm. Roundoff error will be magnified by unstable algorithms. For example,

y_ = \int_0^1 \, \frac dx

for

n = 1, 2, \ldots, 8

with

y_

given. It is easy to show that

y_=\frac-5y_

. Suppose

y_

is our initial value and has a small representation error

\epsilon

, which means the initial input to this algorithm is

y_+\epsilon

instead of

y_

. Then the algorithm does the following sequence of calculations.

\begin
 y_ &= 1-5(y_+\epsilon) = 1-5y_-5\epsilon\\
 y_ &= \frac-5(1-5y_-5\epsilon) = \frac-5+25y_+5^\epsilon\\
 &\ \;\vdots\\
 y_ &= \ldots + 5^\epsilon
\end

The roundoff error is amplified in succeeding calculations so this algorithm is unstable.

Ill-conditioned problems

Even if a stable algorithm is used, the solution to a problem may still be inaccurate due to the accumulation of roundoff error when the problem itself is ill-conditioned. The

condition number In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the inpu ...

of a problem is the ratio of the relative change in the solution to the relative change in the input. A problem is well-conditioned if small relative changes in input result in small relative changes in the solution. Otherwise, the problem is ill-conditioned. In other words, a problem is ill-conditioned if its condition number is "much larger" than 1. The condition number is introduced as a measure of the roundoff errors that can result when solving ill-conditioned problems.

References

External links

Roundoff Error
at MathWorld. *

20 Famous Software Disasters
{{DEFAULTSORT:RoundOff Error Numerical analysis sv:Avrundningsfel

Representation error

Floating-point number system

Notation of floating-point number system

Normalized floating-number system

IEEE standard

Machine epsilon

Roundoff error under different rounding rules

Calculating roundoff error in IEEE standard

Measuring roundoff error by using machine epsilon

Theorem

Proof

Roundoff error caused by floating-point arithmetic

Addition

Multiplication

Division

Subtraction

Accumulation of roundoff error

Unstable algorithms

Ill-conditioned problems

See also

References

Further reading

External links