How to approximate a real number by a machine number

Machine numbers

Normalized binary machine numbers are of the form

± ( . 1 d₂ d₃ . . . d_n )₂ · 2^e

where e is an integer with e_min ≤ e ≤ e_max .

In addition there are special values such as zero, ±Inf (Infinity), NaN (Not a number).

The function fl giving the closest machine number

For a given real number x (different from zero) we want to find the closest machine number, denoted by fl(x) .

Write x in the form
x = ± q · 2^e
with 1/2 ≤ q < 1 and integer e.
If e > e_max let fl(x) = ± Inf (Overflow) . Stop.
If e < e_min let fl(x) = 0 (Underflow) . Stop.
Find the binary representation q = ( . 1 c₂ c₃ c₄. . . )₂ of the mantissa (see algorithm below)
Round the mantissa to n digits: If q is less than the borderline number
( . 1 c₂ c₃ . . . c_n 1 0 0 0 . . .)₂ we round down, otherwise we round up:
If c_n+1 = 0 use ( . 1 c₂ c₃ . . . c_n )₂
If c_n+1 = 1 use ( . 1 c₂ c₃ . . . c_n )₂ + 2^-n
for the mantissa ( . 1 d₂ d₃ . . . d_n )₂ .

In the borderline case we may either round up or down; IEEE 754 specifies to pick the number which has d_n=0.

If ( . 1 c₂ c₃ . . . c_n )₂ = ( . 1 1 1 . . . 1 )₂ and we round up:

Let e := e + 1, ( . 1 d₂ d₃ . . . d_n )₂ = ( . 1 0 0 0 . . . 0 )₂.
If e > e_max let fl(x) = ± Inf (Overflow)

The machine epsilon

The distance between the two mantissa candidates (for rounding up or down) is 2^-n. Since we pick the closer of the two, the absolute error in the mantissa is at most 2^-n-1. Since q ≥ 1/2 we have a relative error in the mantissa of at most 2^-n. We obtain:

If x_min ≤ |x| ≤ x_max, the roundoff error | (fl(x) - x)/ x | satisfies

| (fl(x) - x)/ x | ≤ _M = 2 ^-n

where

_M = 2 ^-n is called the machine epsilon or machine accuracy.

Example

IEEE 754 single precision machine numbers (float in C, real in Fortran) have n = 24 digits in the mantissa and use e_min = -125, e_max = 128.

We want to find fl(x) for x = 1/10 = 0.1 .

0.1 = 0.2 · 2^-1 = 0.4 · 2^-2 = 0.8 · 2^-3. Hence we have q = 0.8, e = 3.
We have -125 ≤ 3 ≤ 128, so there is no overflow or underflow.
By using the algorithm below we find
0.8 = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 . . . )₂
The first 24 digits are ( . 1 1 0 0 1 1 0 0 . . . 1 1 0 0 )₂, and the 25th digit is c₂₅ = 1. Hence we round up, and the rounded mantissa is
( . 1 d₂ d₃ . . . d₂₄ )₂ = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 )₂

Algorithm for finding the binary digits of a number 0 ≤ q < 1

A real number q with 0 ≤ q < 1 has a binary representation

q = ( . c₁ c₂ c₃ . . . )₂ = c₁ · 2^-1 + c₂ · 2^-2 + c₃ · 2^-2 + . . .

with infinitely many digits c₁, c₂, c₃, c₄, . . . which can be found by the following algorithm:

For k=1 to infinity:
      q := 2 q
      c_k := floor(q)
      q := q - floor(q)