How to approximate a real number by a machine number

Machine numbers

Normalized binary machine numbers are of the form

± ( . 1 d2 d3 . . . dn )2 · 2e
where e is an integer with emineemax .

In addition there are special values such as zero, ±Inf (Infinity), NaN (Not a number).

The function fl giving the closest machine number

For a given real number x (different from zero) we want to find the closest machine number, denoted by fl(x) .
  1. Write x in the form
    x = ± q · 2e
    with 1/2 ≤ q < 1 and integer e.
  2. If e > emax let fl(x) = ± Inf (Overflow) . Stop.
    If e < emin let fl(x) = 0 (Underflow) . Stop.
  3. Find the binary representation q = ( . 1 c2 c3 c4. . . )2 of the mantissa (see algorithm below)
  4. Round the mantissa to n digits: If q is less than the borderline number
    ( . 1 c2 c3 . . . cn 1 0 0 0 . . .)2 we round down, otherwise we round up:

    If cn+1 = 0 use ( . 1 c2 c3 . . . cn )2
    If cn+1 = 1 use ( . 1 c2 c3 . . . cn )2 + 2-n
    for the mantissa ( . 1 d2 d3 . . . dn )2 .

    In the borderline case we may either round up or down; IEEE 754 specifies to pick the number which has dn=0.

    If ( . 1 c2 c3 . . . cn )2 = ( . 1 1 1 . . . 1 )2 and we round up:

    Let e := e + 1, ( . 1 d2 d3 . . . dn )2 = ( . 1 0 0 0 . . . 0 )2.
    If e > emax let fl(x) = ± Inf (Overflow)

The machine epsilon

The distance between the two mantissa candidates (for rounding up or down) is 2-n. Since we pick the closer of the two, the absolute error in the mantissa is at most 2-n-1. Since q ≥ 1/2 we have a relative error in the mantissa of at most 2-n. We obtain:

If xmin ≤ |x|xmax, the roundoff error | (fl(x) - x)/ x | satisfies

| (fl(x) - x)/ x | ≤ epsilonM = 2 -n
where epsilonM = 2 -n is called the machine epsilon or machine accuracy.

Example

IEEE 754 single precision machine numbers (float in C, real in Fortran) have n = 24 digits in the mantissa and use emin = -125, emax = 128.

We want to find fl(x) for x = 1/10 = 0.1 .

  1. 0.1 = 0.2 · 2-1 = 0.4 · 2-2 = 0.8 · 2-3. Hence we have q = 0.8, e = 3.
  2. We have -125 ≤ 3 ≤ 128, so there is no overflow or underflow.
  3. By using the algorithm below we find
    0.8 = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 . . . )2
  4. The first 24 digits are ( . 1 1 0 0 1 1 0 0 . . . 1 1 0 0 )2, and the 25th digit is c25 = 1. Hence we round up, and the rounded mantissa is
    ( . 1 d2 d3 . . . d24 )2 = ( . 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 )2

Algorithm for finding the binary digits of a number 0 ≤ q < 1

A real number q with 0 ≤ q < 1 has a binary representation
q = ( . c1 c2 c3 . . . )2 = c1 · 2-1 + c2 · 2-2 + c3 · 2-2 + . . .
with infinitely many digits c1, c2, c3, c4, . . . which can be found by the following algorithm:
For k=1 to infinity:
      q := 2 q
      ck := floor(q)
      q := q - floor(q)