How to approximate a real number by a machine number
Machine numbers
Normalized binary machine numbers are of the form
± ( . 1 d2 d3 . . .
dn )2 · 2e
where e is an integer with emin ≤ e ≤
emax .
In addition there are special values such as zero, ±Inf
(Infinity), NaN (Not a number).
The function fl giving the closest machine number
For a given real number x (different from zero) we want to find the
closest machine number, denoted by fl(x) .
- Write x in the form
x = ± q · 2e
with 1/2 ≤ q < 1 and integer e.
- If e > emax let fl(x) =
± Inf (Overflow) . Stop.
If e < emin let fl(x) = 0
(Underflow) . Stop.
- Find the binary representation q = ( . 1 c2
c3 c4. . . )2 of the
mantissa (see algorithm below)
- Round the mantissa to n digits: If q is less than the
borderline number
( . 1 c2 c3 . . .
cn 1 0 0 0 . . .)2 we round down, otherwise
we round up:
If cn+1 = 0 use ( . 1
c2 c3 . . . cn
)2
If cn+1 = 1 use ( . 1
c2 c3 . . . cn
)2 + 2-n
for the mantissa ( . 1 d2 d3 . . .
dn )2 .
In the borderline case we may either round up or down; IEEE 754
specifies to pick the number which has dn=0.
If ( . 1 c2 c3 . . .
cn )2 = ( . 1 1 1 . . . 1 )2
and we round up:
Let e := e + 1, ( . 1 d2
d3 . . . dn )2 = ( .
1 0 0 0 . . . 0 )2.
If e > emax let fl(x) = ± Inf
(Overflow)
The machine epsilon
The distance between the two mantissa candidates (for rounding up or down) is
2-n. Since we pick the closer of the two, the absolute error
in the mantissa is at most 2-n-1. Since q ≥ 1/2
we have a relative error in the mantissa of at most 2-n. We
obtain:
If xmin ≤ |x|
≤ xmax, the roundoff error |
(fl(x) - x)/ x | satisfies
| (fl(x) - x)/ x | ≤
M = 2 -n
where
M = 2
-n is called the machine epsilon or machine
accuracy.
Example
IEEE 754 single precision machine numbers (float in C,
real in Fortran) have n = 24 digits in the mantissa and use
emin = -125, emax = 128.
We want to find fl(x) for x = 1/10 = 0.1 .
- 0.1 = 0.2 · 2-1 = 0.4 · 2-2 = 0.8 ·
2-3. Hence we have q = 0.8, e = 3.
- We have -125 ≤ 3 ≤ 128, so there is no overflow or
underflow.
- By using the algorithm below we find
0.8 = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 . . .
)2
- The first 24 digits are ( . 1 1 0 0 1 1 0 0 . . . 1 1 0 0 )2,
and the 25th digit is c25 = 1. Hence we round up, and
the rounded mantissa is
( . 1 d2 d3 . . .
d24 )2 = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
0 1 1 0 0 1 1 0 1 )2
Algorithm for finding the binary digits of a number 0 ≤
q < 1
A real number q with 0 ≤ q < 1 has a binary
representation
q = ( . c1 c2
c3 . . . )2 = c1 ·
2-1 + c2 · 2-2 +
c3 · 2-2 + . . .
with infinitely many digits c1, c2,
c3, c4, . . . which can be found by
the following algorithm:
For k=1 to infinity:
q := 2 q
ck := floor(q)
q := q - floor(q)