Skip to main content

Section 2.4 Floating Point Number Systems

Another important subject needed for numerical analysis is that of floating point numbers. In mathematics, we have numbers such as integers, rationals, reals and complex. These are great for doing mathematics, however when we try to use them for computations, they aren’t so great. For example, how do you store the number \(x=\sqrt{2}\text{?}\) In a computer, you have to limit the number of stored digits for a decimal number. Also, how do you store the number of atoms in the universe? It’s an integer, but it’s big.
If we look at the integers, we could store an arbitrary integer just by writing down the decimal digits according to any integer. However, this is not very practical (although it can be done), and for most situations, arbitrary-size integers are not needed. Therefore most computer software packages have an upper and lower limit on the integers.
We can also do something similar with the rationals, because of the fact that any rational is a ratio of integers, and I will skip this discussion.
Storage of a real is also a problem, and most computer software use a floating point number as an approximation. Before discussing details about these, we will review writing a number in scientific notation.

Subsection 2.4.1 Scientific Notation

Recall that any number written in decimal form with only a finite number of digits can be written in scientific notation that is in the form:
\begin{equation*} \begin{aligned}a \times 10^{b}\end{aligned} \end{equation*}
where \(1<|a|<10\) and \(b\) is an integer. For example \(4003.23\) can be written as \(4.00323 \times 10^{3}\text{,}\) so \(a=4.00323\) and \(b=3\text{.}\)
In this form the number \(a\) is often called the significand or mantissa and the number \(b\) is exponent. This example has the base 10, however other bases are common (generally base 2).
One major advantage to using numbers in this form is the simple multiplication and division. Consider multiplying \(x=3.4 \times 10^{2}\) and \(y=-4.7 \times 10^{-3}\text{.}\) Using properties of exponentials we get
\begin{equation*} \begin{aligned}xy \amp = (3.4)(-4.7) \times 10^{2-3}= -15.98 \times 10^{-1}\end{aligned} \end{equation*}
and typically we would like to put this back into scientific notation by shifting the exponent so \(xy=-1.598 \times 10^{0}\text{.}\)
Division can be done in a similar manner and perhaps surprisingly, addition and subtraction are more difficult due to the fact that the exponents of the two numbers need to be equal before adding and subtracting.

Subsection 2.4.2 Floating Point Numbers of a given size

The reason for using floating point numbers in calculations is twofold. First, there is a finite size of storage for a number and secondly, routines for performing operations on floating-point numbers are fast and usually encoded on a computer chip.
Consider a floating point of a given size, say 64 bits generally called a double precision floating point number. The first bit is generally used for the sign, the next 11 are the exponent and the final 52 bits store the mantissa. A floating point number has two limitations and that is the precision (how many digits that can be stored) and the magnitude (the largest number). Double precision numbers are store in binary and converted to decimal with the form:
\begin{equation*} \begin{aligned}(-1)^{s} 2^{c-1023}(1+f)\end{aligned} \end{equation*}
where \(s\) is the sign \(c\) is the exponent and \(f\) stores the mantissa. For example, consider the following number:
\begin{equation*} \begin{aligned}0\;10000000101\;0111011010000000000000000000000000000000000000000000\end{aligned} \end{equation*}
where spaces separate out \(s\text{,}\) \(c\) and \(f\text{.}\) Converting \(c\) to decimal:
\begin{equation*} \begin{aligned}c \amp = 1 \cdot 2^{10}+ 0 \cdot 2^{9} + \cdots + 1 \cdot 2^{2} + 0 \cdot 2^{1} + 1 \cdot 2^{0} = 1029\end{aligned} \end{equation*}
The mantissa is calculated in the following way
\begin{equation*} \begin{aligned}f \amp = 0 \cdot \biggl(\frac{1}{2}\biggr)^{1} + 1 \cdot \biggl(\frac{1}{2}\biggr)^{2} + 1 \cdot \biggl(\frac{1}{2}\biggr)^{3} + 1 \cdot \biggl(\frac{1}{2}\biggr)^{4} + 0 \cdot \biggl(\frac{1}{2}\biggr)^{5} + 1 \cdot \biggl(\frac{1}{2}\biggr)^{6} \\ \amp \qquad + 1 \cdot \biggl(\frac{1}{2}\biggr)^{7} + 0 \cdot \biggl(\frac{1}{2}\biggr)^{8} + 1 \cdot \biggl(\frac{1}{2}\biggr)^{9} = \frac{237}{512}\end{aligned} \end{equation*}
and thus the floating point number is:
\begin{equation*} \begin{aligned}(-1)^{0} 2^{1029-1023}\biggl(1+\frac{237}{512}\biggr) = 93.625\end{aligned} \end{equation*}
The double precision number system falls into a class of number systems that we can commonly call floating-point number systems:

Definition 2.4.1. Floating Point Number System.

A floating point number system, \(\mathbf{F}(\beta, k, m, M)\) is a subset of the real numbers characterized by the parameters:
Table 2.4.2. TITLE
\(\beta:\) the base
\(k\text{:}\) the number of digits in the mantissa
\(m\text{:}\) the minimum exponent
\(M\text{:}\) the maximum exponent
Elements of \(\mathbf{F}\) have the form:
\begin{equation*} \begin{aligned}\pm(0.d_{1}d_{2}\ldots d_{k}) \times \beta^{e}\end{aligned} \end{equation*}
where \(d_{1}\) is nonzero (except when representing the number 0) and \(d_{i} \in \{0,1,2, \ldots, \beta\}\) and \(m\leq e \leq M\text{.}\)
The double precision number system listed above is \(\mathbf{F}(2,52,-1023,1024)\text{.}\)

Example 2.4.3.

Consider the floating point system \(\mathbf{F}(10,2,0,2)\text{,}\) a very simple but illustrative as an example. The following numbers are element of this set.
\begin{equation*} \begin{aligned}0, \amp \pm 0.10, \pm 0.11, \pm 0.12, \ldots, \pm 0.19 \\ \amp \pm 0.20, \pm 0.21, \pm 0.22 \dots, \pm 0.29 \\ \amp \vdots \\ \amp \pm 0.90, \pm 0.91, \pm 0.92 \ldots, \pm 0.99 \\ \amp \pm 1.0, \pm 1.1, \pm 1.2, \ldots, \pm 1.9 \\ \amp \pm 2.0, \pm 2.0, \pm 2.2, \ldots, \pm 2.9 \\ \amp \vdots \\ \amp \pm 9.0, \pm 9.1, \pm 9.2, \ldots, \pm 9.9 \\ \amp \pm 10, \pm 11, \pm 12, \ldots, \pm 19 \\ \amp \pm 20, \pm 20, \pm 22, \ldots, \pm 29 \\ \amp \vdots \\ \amp \pm 90, \pm 91, \pm 92, \ldots, \pm 99\end{aligned} \end{equation*}
There are a total of 271 numbers in this system.
There are a few things to notice about a floating point number system. The set of numbers is finite and not distributed evenly.

Subsection 2.4.3 Rounding Floating Point Numbers

As we saw above, an issue with floating-point numbers is the precision and size of the number, a more significant issue is that of rounding error when performing operations with such numbers.
Consider the FPS we saw above \(\textbf{F}(10,2,0,2)\text{.}\) If we take two of the numbers in the system, \(x=2.1\) and \(y=0.26\) and add, we get \(x+y=2.36\text{,}\) which is not represented in the system. In this case there are two options that are typically available: rounding and chopping.
In rounding, if \(x\) cannot be represented in the system, the nearest number to \(x\) is used. In the example above, 2.4 would be used.
In chopping, if \(x\) cannot be represented in the system, the number found by chopping the extra digits is used. In the example above, 2.3 would be used.

Subsection 2.4.4 Overflow and Underflow

If a operation results in a number between 0 and the number in the system closest to 0, this is called underflow. For example, if we took \(0.20\) and divided by \(50\) in the system used above, the result is 0.004, which cannot be represented, so an underflow occurs.
If an operation results in a number larger in magnitude to the largest number in the system, then this is called overflow. For example, if we multiply 44 and 50 in the system above, the result is 2200, which is larger than 99 (the largest number), so overflow occurs.

Subsection 2.4.5 Rounding Error

Regardless of the number of digits used in the floating point system, rounding or chopping occurs with nearly every operation. The difference between the actual result and the number used is called the rounding error. (It doesn’t matter if numbers are round or chopped.)

Definition 2.4.4.

Let \(p^{\star}\) be an approximation to the number \(p\text{.}\) The absolute error is
\begin{equation*} |p^{\star}- p| \end{equation*}
and the relative error is
\begin{equation*} \frac{|p^{\star}-p|}{|p|} \end{equation*}

Example 2.4.5.

Let \(p=\sqrt{3}\) and \(p^{\star}=1.73\text{.}\) Find the absolute and relative errors.
The absolute error is:
\begin{equation*} \begin{aligned}|p^{\star}-p| \amp = |1.73-\sqrt{3}| = 0.00205\end{aligned} \end{equation*}
and the relative error is
\begin{equation*} \begin{aligned}\frac{|p^{\star}-p|}{|p|} \amp = \frac{|1.73-\sqrt{3}|}{|\sqrt{3}|}= \frac{0.00205}{\sqrt{3}}= 0.00118\end{aligned} \end{equation*}

Subsection 2.4.6 Practical Effects of Floating-Point Arithmetic

We saw a simple example of a floating-point system above and although it was quite simple, the floating-point systems that most computers employ have similar limitations. We will see these affects in a simple example from using the quadratic formula as well as solving a linear system.

Subsubsection 2.4.6.1 Cancellation Error

One of the biggest sources of error arises when subtracting two comparable numbers. The best example comes from a familiar place.
Consider the following example. Solve \(0.27x^{2} - 99.4x + 4.2 =0\) using the quadratic formula.
In this case, to emphasize the importance of floating-point numbers, let’s use 4 digits of precision to compute:
\begin{equation*} \begin{aligned}x \amp = \frac{-b \pm \sqrt{b^{2} -4ac}}{2a}\\ \amp = \frac{99.4 \pm \sqrt{9880.36 - 4.536}}{0.54} \amp \amp \text{round to 4 digits}\\ \amp \approx \frac{99.4 \pm \sqrt{9880 - 4.536}}{0.54}\\ \amp = \frac{99.4 \pm \sqrt{9875.464}}{0.54} \amp \amp \text{round to 4 digits}\\ \amp \approx \frac{99.4 \pm \sqrt{9875}}{0.54}\\ \amp = \frac{99.4 \pm 99.373}{0.54} \amp \amp \text{round to 4 digits}\\ \amp \approx \frac{99.40 \pm 99.37}{0.54}\\ \amp = \frac{0.03}{0.54}, \frac{198.8}{0.54} \amp \amp \text{round to 4 digits}\\ \amp = 0.05555555, 368.148 \\ \amp \approx 0.05555, 368.1\end{aligned} \end{equation*}
When solving this using no roundoff, the roots are \(368.1058898, 0.0422584\text{,}\) therefore the larger of the two is correct to 4 digits, but the smaller is off and way off in relative terms it is 31%.
What’s going on here? The issue with this problem is that the two terms in the numerator the \(-b\) and the \(\sqrt{b^{2}-4ac}\) are close to one another resulting in cancellation.
One way to solve this problem is on the first term where \(-b\) and \(\sqrt{b^{2}+4ac}\) are added, that is
\begin{equation*} \begin{aligned}x_{1} \amp = \frac{-b+\sqrt{b^{2}-4ac}}{2a}\end{aligned} \end{equation*}
if we rationalize the numerator, we can write:
\begin{equation*} \begin{aligned}x_{1} \amp = \frac{(-b+\sqrt{b^{2}-4ac})(-b-\sqrt{b^{2}-4ac})}{2a(-b-\sqrt{b^2-4ac}}\\ \amp = \frac{4ac}{2a(-b-\sqrt{b^2-4ac})}\\ \amp = \frac{2c}{-b-\sqrt{b^{2}-4ac}}\end{aligned} \end{equation*}
and get an analogous form of the quadratic formula. If we use \(a=0.27\text{,}\) \(b=99.4\) and \(c=4.2\text{,}\) we get 0.04225, identical to 4 digits using the result above.

Subsubsection 2.4.6.2 Matrices and Roundoff Errors

One of the major places that roundoff errors occur is in matrix operations. Consider for example performing Gaussian Elimination on the following matrix:
\begin{equation*} \left[\begin{array}{rrr|r} 6 \amp 1 \amp 5 \amp 13 \\ 2 \amp 1/3 \amp -9 \amp -17 \\ 1 \amp 4 \amp 2 \amp 34/5 \end{array}\right] \end{equation*}
If we use rational numbers to solve this problem, then we get for the first step:
\begin{equation*} \begin{array}{r}-\frac{1}{3} R_1 + R_2 \rightarrow R_2 \\ -\frac{1}{6} R_1 + R_3 \rightarrow R_3\end{array} \qquad \left[\begin{array}{rrr|r} 6 \amp 1 \amp 5 \amp 13 \\ 0 \amp 0 \amp -32/3 \amp -64/3 \\ 0 \amp 23/6 \amp 7/6 \amp 139/30 \end{array}\right] \end{equation*}
From the second row, you can see that \(-32/3 z = -64/3\) or \(z=2\text{,}\) the other two rows and back substitution can get you the results that \(x=2/5\) and \(y=3/5\text{.}\)
If instead we use floating point arithmetic with 4 decimal places of accuracy, then the original matrix is:
\begin{equation*} \left[\begin{array}{rrr|r} 6.0 \amp 1.0 \amp 5.0 \amp 13.0 \\ 2.0 \amp 0.3333 \amp -9.0 \amp -17.0 \\ 1.0 \amp 4.0 \amp 2.0 \amp 6.8 \end{array}\right] \end{equation*}
and performing the same two operations, you get:
\begin{equation*} \begin{array}{r}-\frac{1}{3} R_1 + R_2 \rightarrow R_2 \\ -\frac{1}{6} R_1 + R_3 \rightarrow R_3\end{array} \qquad \left[\begin{array}{rrr|r} 6.0 \amp 1.0 \amp 5.0 \amp 13.0 \\ 0.0 \amp -3.333 \times 10^{-5} \amp -10.66 \amp -21.33 \\ 0.0 \amp 3.833 \amp 1.166 \amp 4.633\end{array}\right] \end{equation*}
Lastly, eliminating the 3rd row, 2nd column, we get:
\begin{equation*} \begin{array}{r}-1.15 \times 10^{-5} R_2 +R_3 \rightarrow R_3\end{array} \qquad \left[\begin{array}{rrr|r} 6.0 \amp 1.0 \amp 5.0 \amp 13.0 \\ 0.0 \amp -3.333 \times 10^{-5} \amp -10.66 \amp -21.33 \\ 0.0 \amp 0 \amp 1226000 \amp 2453000\end{array} \right] \end{equation*}
Now, using back substitution, we can get that \(z=2.001\text{.}\) Using the second row, we get that
\begin{equation*} y = \frac{-21.33 + 10.66(2.001)}{-3.333 \times 10^{-5}}= 0.0001 \end{equation*}
and then
\begin{equation*} x = \frac{13.0 - 0.001 - 5(2.001)}{6} = 0.499 \end{equation*}
Compared with the results using rational numbers, the solutions are off by a tremendous amount. We will see in Chapter 7 how to perform matrix row operations to minimize such errors.

Subsection 2.4.7 Why not just use Rational Numbers?

With the results of the previous problem, one solution would be to just use rational numbers. In many cases, this is a great idea, however in general, as the size of the matrix increase, the computations become costly. Also, there are limitation in storage for rational numbers as well, so it may not be practical.