Skip to main content

Appendix C Binary Represenation of Numbers

In Chapter 3, numeric data types were introduced. Some details of integers and floating points were given, however, some of the gorier details were left to here.

Section C.1 Integer Representation

We discussed the binary representation of non-negative numbers in Chapter 3. Negative numbers is more difficult. This method of data storage in called 2’s complement. A nice description of this is Ryan’s Tutorial on negative binary numbers.
Although seems a bit strange, let’s first discuss the conversion of binary to integer. Let the 8-bit binary number be written as \(x=sa_6a_5a_4a_3a_2a_1a_0\text{,}\) where the first bit is the sign and the other 7-bit compose the number then representation of \(x\) in decimal is
\begin{equation} x = -s \cdot 2^7 + a_6 2^6 + a_5 2^5 + a_4 2^4 + a_32^3 + a_2 2^2 + a_1 2^1 + a_0 2^0\tag{C.1} \end{equation}
As an example, note that if \(s=0\text{,}\) then this is identical to the integer representation of positive integer. Again, 00010010 would be
\begin{equation*} x = -0 \cdot 128 + 0 \cdot 64 + 0 \cdot 32 + 1 \cdot 16 + 0 \cdot 8 + 0 \cdot 4 + 1 \cdot 2 + 0 \cdot 1 = 18 \end{equation*}
Let’s look at an example with a negative number. Let x=11010111, then in decimal:
\begin{align*} x \amp = -1 \cdot 128 + 1 \cdot 64 + 0 \cdot 32 + 1 \cdot 16 + 0 \cdot 8 + 1 \cdot 4 + 1 \cdot 2 + 1 \cdot 1 \\\\ \amp = -128 + 64 + 16 + 4 +2 + 1 = -41 \end{align*}

Subsection C.1.1 Converting from decimal to binary

Again, like above, let’s concentrate only on 8-bit numbers. That is, we will take a decimal number x in the range \([-128,127]\) and return an array of length 8 of bits (you can even think of booleans) or we need to find \(s, a_0,a_1,\ldots,a_6\) from (C.1).
First, let’s examine only non-negative numbers, so \(s=0\text{.}\) And let’s use the number \(x=47\text{.}\) We’ll work from the larger digits to the smaller ones, so \(a_6\) to \(a_0\text{.}\)
  • Let x6 = 47
  • Let x5 = mod(x6,64) = 47 and a6 = div(x6,64) = 0.
  • Let x4 = mod(x5,32) = 15 and a5 = div(x5,32) = 1.
  • Let x3 = mod(x4,16) = 15 and a4 = div(x4,16) = 0.
  • Let x2 = mod(x3,8) = 7 and a3 = div(x3,8) = 1.
  • Let x1 = mod(x2,4) = 3 and a2 = div(x2,4) = 1.
  • Let x0 = mod(x1,2) = 1 and a1 = div(x1,2) = 1.
This shows that the result is 00101111. Again, bitstring(Int8(47)) returns "00101111" the same result.
A function that does this is as follows.
function dec2bin(n::Int)
  -128 <= n <=127 || throw(ArgumentError("The value of n must be between -128 and 127"))
  bdigits=zeros(Bool,8)
  ## this function calclates the digit of the decimal number k that goes in slot digit
  function calcDigit(k::Int,digit::Int)
    if digit==0
      bdigits[1] = k
    else
      bdigits[digit+1] = div(k,2^digit)
      calcDigit(mod(k,2^digit),digit-1)
    end
  end
  calcDigit(n,6)
  bdigits
end
Note that first this uses recursion as explained in Chapter 4. In addition, the error checking is explained in Chapter 11. This function returns a boolean array (ones and zeros). For example:
dec2bin(47)
returns [1 1 1 1 0 1 0 0]. And note that since the array returns the digits in the order from small to large, this result is the same as we showed by hand above and using the function bitstring.

Subsection C.1.2 Representing Negative Integers

To find the representation of a negative integer, we return to (C.1) and note that \(s=1\text{.}\) A little algebra results in
\begin{equation*} x+s \cdot 2^7 = a_6 2^6 + a_5 2^5 + a_4 2^4 + a_32^3 + a_2 2^2 + a_1 2^1 + a_0 2^0 \end{equation*}
and with \(s=1\text{,}\) we define the left side as \(y=x+127\) and now the right side is the same representation of a positive number and if \(-128 \leq x \leq -1\text{,}\) then \(0 \leq y \leq 127\text{.}\)
Let’s check out an example. Consider \(x=-67\text{,}\) then \(y=128+x=61\text{.}\) Using the algorithm to find the binary representation of 61 or dec2bin(61) which returns [1 0 1 1 1 1 0 0] recalling that the order is reversed from the function above.
Since we are looking for the decimal representation of -67, not 61, the sign bit is 1, so we have 10111101 and checking with the built-in function bitstring(Int8(-67)) or "10111101".
This method of storage of negative integers is called 2’s complement.

Check Your Understanding C.1.

Subsection C.1.3 Including negative Integers

If we extend the function above to include negative numbers, the following:
function dec2bin(n::Int)
  bdigits=zeros(Bool,8)
  ## this function calclates the digit of the decimal number k that goes in slot digit
  function calcDigit(k::Int,digit::Int)
    if digit==0
      bdigits[1] = k
    else
      bdigits[digit+1] = div(k,2^digit)
      calcDigit(mod(k,2^digit),digit-1)
    end
  end
  if n>=0
    calcDigit(n,6)
  else
    calcDigit(128+n,6)
    bdigits[8]=1
  end
  bdigits
end
Calling dec2bin(-57) results in [1 1 1 0 0 0 0 0] and this is the same as above. The output is a little tricky so we have another function that will take this boolean array to a string.
arr2str(arr::Array{Bool,1}) = reduce((x, y)->string(x,y ? "1" : "0"),reverse(arr),init="")
where the reduce function in Chapter 4 is used and the function reverse takes an array and reverses the elements. This is must more convenient with
				arr2str(dec2bin(-57))
				
returns "00000111", the same as above and the same as bitstring for 8-bit integers.

Check Your Understanding C.2.

Subsection C.1.4 A binary to decimal function

It’s also nice to have a function that reverses this.
## this function takes a 0-1 array (boolean array) and changes it to a decimal.
# the first element is the 2^0 digits, the 7th digit is the 2^6 digit and the 8th digit is the sign digit
# this uses the 2's complement representation
#
function bin2dec(arr::Array{Bool,1})
  sum=0
  for i=1:7
    sum += arr[i]*2^(i-1)
  end
  sum - arr[8]*128
end
To use this, we need to start with a binary array, so let’s start with
arr = map(x-> x==1 ? true : false, [1,1,1,0,0,0,1,1])
and then we can use the bin2dec function as
bin2dec(arr)
returns -57.
And the following is a function that will take a binary string and turn it to an array
str2arr(str::String) = reverse(map(x-> x=="1" ? true : false, split(str,"")))
If we use this in conjunction with bin2dec, we have some convenient functions. That is
bin2dec(str2arr("11000111"))
results in -57.

Section C.2 Operations on Binary Integers

Adding two positive binary integers is much like adding decimal integers. An example of \(54+19\) could be:
\begin{align*} & \phantom{+~}00110110\\ & \underline{+\phantom{~}00010011}\\ & \phantom{+~}01001001 \end{align*}
where the standard algorithm is used
The result of this in decimal is bin2dec(str2arr("01001001")) or 73.
Note that if we try to add 01001100 and 01100010 we get 10101110 (try it!) and if we use the conversion to get back to decimal results in bin2dec(str2arr("10101110")) or -82, which clearly is not the sum of two positive numbers. The reason this occurred is that the sum of the two numbers does not fit in an 8-bit integer, whose max value is 127. We can tell this directly from the binary result in that the first digit is a 1 (due to the sum of the two one’s in the second position). Recall that this first digit is a sign bit. This is an overflow error.

Subsection C.2.1 Unary minus

The unary minus operator negates a number. To see how to do this in binary, let’s look at what a bit flip looks like with a few examples. Let’s take the decimal number 54 or 00110110 and flip all of the bits (changes 0s to 1s and 1s to 0s) to get 11001001 and using bin2dec(str2arr("11001001")) which returns -55. If we use 10101010 which is \(-86\text{,}\) the bit flips are 01010101 which is \(85\text{.}\) Note that the sum of a binary number and it’s bit flips is 11111111 which is \(-1\text{.}\)
To determine the negation of a binary number, we will 1) flip the bits and 2) add one. So for example, 00101101 or 45. The bit flip of this is 11010010 and then add one to get 11010011, which is -45.

Check Your Understanding C.3.

(a)
Show that the unary minus of 0 in binary is still 0. (Note: this is one of the very nice features of using 2’s complement for storing integers.)
(b)
There is one 8-bit integer than does not have negative (that is an 8-bit integer). What is it?

Section C.3 Subtraction of binary integers

There are a couple of ways to think about subtraction. One can use the column algorithm like is standard for subtraction of decimal numbers and adapt to binary integers. However an easier way is to thing of \(a-b\) as \(a+(-b)\) and use the negative from the previous section. For example, look at \(78-85\text{.}\)
  1. The two integers in binary are "01001110"and "01010101".
  2. The negation of 85 is 10101010 + 00000001 or 10101011
  3. The sum of 01001110 and 10101011 is "00111001".
  4. converting back to decimal results in \(-7\text{.}\)

Section C.4 Multiplication of Binary Integers

\begin{align*} &\phantom{\times~}00001010 \\ &\underline{\times\phantom{~}00000110~} \\ &\phantom{\times~}00000000 \\ &\phantom{\times~}00010100 \\ &\underline{+\phantom{~}00101000~} \\ &\phantom{\times~}00111100 \end{align*}
which shows that 10 \(\times\) 6 equals 60.
Discuss multiplication of negative integers...

Check Your Understanding C.4.

Put all of these functions in a module as in Chapter \ref{ch:modules}. This will include

(a)

Creating a type called Integer8 which is an alias for a binary array of length 8.

(b)

Create a Base.show method for Integer8 to display as a string. See Chapter \ref{ch:modules} to do this and use the function in this chapter

(c)

Create a function parseBin, which takes in a string of length 8, and returns a value of type Integer8. Use the str2arr function from this chapter.

(d)

Create a unary minus, plus and minus on the type Integer8 types. Use the example of Polynomials in section \ref{sect:parametric-types} for examples on how to do this.

(e)

Create a test suite to thoroughly test all of the functions that you write.

Section C.5 Scientific Notation

Recall that any number written in decimal form with only a finite number of digits can be written in scientific notation that is in the form: \(a \times 10^{b}\)
where \(1<|a|<10\) and \(b\) is an integer. For example \(4003.23\) can be written as \(4.00323 \times 10^{3}\text{,}\) so \(a=4.00323\) and \(b=3\text{.}\)
In this form the number \(a\) is often called the \textbf{significand} or \textbf{mantissa} and the number \(b\) is \textbf{exponent}. This example has the base 10, however other bases are common (generally base 2).
One major advantage to using numbers in this form is the simple multiplication and division. Consider multiplying \(x=3.4 \times 10^{2}\) and \(y=-4.7 \times 10^{-3}\text{.}\) Using properties of exponentials we get
\begin{equation*} xy = (3.4)(-4.7) \times 10^{2-3} = -15.98 \times 10^{-1} \end{equation*}
and typically we would like to put this back into scientific notation by shifting the exponent so \(xy=-1.598 \times 10^{0}\text{.}\)
Division can be done in a similar manner and perhaps surprisingly, addition and subtraction are more difficult due to the fact that the exponents of the two numbers need to be equal before adding and subtracting.

Section C.6 Floating Point Numbers of a given size

The reason for using floating point numbers in calculations is twofold. First, there is a finite size of storage for a number and secondly, routines for performing operations on floating-point numbers are fast and usually encoded on a computer chip.
Consider a floating point of a given size, say 64 bits generally called a double precision floating point number. The first bit is generally used for the sign, the next 11 are the exponent and the final 52 bits store the mantissa. A floating point number has two limitations and that is the precision (how many digits that can be stored) and the magnitude (the largest number). Double precision numbers are store in binary and converted to decimal with the form:
\begin{equation*} (-1)^{s} 2^{c-1023}(1+f) \end{equation*}
where \(s\) is the sign \(c\) is the exponent and \(f\) stores the mantissa. For example, consider the following number:
\begin{equation*} 0\;10000000101\;0111011010000000000000000000000000000000000000000000 \end{equation*}
where spaces separate out \(s\text{,}\) \(c\) and \(f\text{.}\) Converting \(c\) to decimal:
\begin{equation*} c = 1 \cdot 2^{10} + 0 \cdot 2^{9} + \cdots + 1 \cdot 2^{2} + 0 \cdot 2^{1} + 1 \cdot 2^{0} = 1029 \end{equation*}
The mantissa is calculated in the following way
\begin{align*} f & = 0 \cdot \biggl(\dfrac{1}{2}\biggr)^{1} + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{2} + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{3} + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{4} + 0 \cdot \biggl(\dfrac{1}{2}\biggr)^{5} + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{6} \\ & \qquad + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{7} + 0 \cdot \biggl(\dfrac{1}{2}\biggr)^{8} + 1 \cdot \biggl(\dfrac{1}{2}\biggr)^{9} = \frac{237}{512} \end{align*}
and thus the floating point number is:
\begin{equation*} (-1)^{0} 2^{1029-1023} \left(1+\dfrac{237}{512}\right) = 93.625 \end{equation*}
The double precision number system falls into a class of number systems that we can commonly call floating-point number systems.