In Chapter 3, numeric data types were introduced. Some details of integers and floating points were given, however, some of the gorier details were left to here.
We discussed the binary representation of non-negative numbers in Chapter 3. Negative numbers is more difficult. This method of data storage in called 2’s complement. A nice description of this is Ryan’s Tutorial on negative binary numbers.
Although seems a bit strange, let’s first discuss the conversion of binary to integer. Let the 8-bit binary number be written as \(x=sa_6a_5a_4a_3a_2a_1a_0\text{,}\) where the first bit is the sign and the other 7-bit compose the number then representation of \(x\) in decimal is
Again, like above, let’s concentrate only on 8-bit numbers. That is, we will take a decimal number x in the range \([-128,127]\) and return an array of length 8 of bits (you can even think of booleans) or we need to find \(s, a_0,a_1,\ldots,a_6\) from (C.1).
First, let’s examine only non-negative numbers, so \(s=0\text{.}\) And let’s use the number \(x=47\text{.}\) We’ll work from the larger digits to the smaller ones, so \(a_6\) to \(a_0\text{.}\)
function dec2bin(n::Int)
-128 <= n <=127 || throw(ArgumentError("The value of n must be between -128 and 127"))
bdigits=zeros(Bool,8)
## this function calclates the digit of the decimal number k that goes in slot digit
function calcDigit(k::Int,digit::Int)
if digit==0
bdigits[1] = k
else
bdigits[digit+1] = div(k,2^digit)
calcDigit(mod(k,2^digit),digit-1)
end
end
calcDigit(n,6)
bdigits
end
Note that first this uses recursion as explained in Chapter 4. In addition, the error checking is explained in Chapter 11. This function returns a boolean array (ones and zeros). For example:
returns [1 1 1 1 0 1 0 0]. And note that since the array returns the digits in the order from small to large, this result is the same as we showed by hand above and using the function bitstring.
and with \(s=1\text{,}\) we define the left side as \(y=x+127\) and now the right side is the same representation of a positive number and if \(-128 \leq x \leq -1\text{,}\) then \(0 \leq y \leq 127\text{.}\)
Let’s check out an example. Consider \(x=-67\text{,}\) then \(y=128+x=61\text{.}\) Using the algorithm to find the binary representation of 61 or dec2bin(61) which returns [1 0 1 1 1 1 0 0] recalling that the order is reversed from the function above.
Since we are looking for the decimal representation of -67, not 61, the sign bit is 1, so we have 10111101 and checking with the built-in function bitstring(Int8(-67)) or "10111101".
function dec2bin(n::Int)
bdigits=zeros(Bool,8)
## this function calclates the digit of the decimal number k that goes in slot digit
function calcDigit(k::Int,digit::Int)
if digit==0
bdigits[1] = k
else
bdigits[digit+1] = div(k,2^digit)
calcDigit(mod(k,2^digit),digit-1)
end
end
if n>=0
calcDigit(n,6)
else
calcDigit(128+n,6)
bdigits[8]=1
end
bdigits
end
Calling dec2bin(-57) results in [1 1 1 0 0 0 0 0] and this is the same as above. The output is a little tricky so we have another function that will take this boolean array to a string.
## this function takes a 0-1 array (boolean array) and changes it to a decimal.
# the first element is the 2^0 digits, the 7th digit is the 2^6 digit and the 8th digit is the sign digit
# this uses the 2's complement representation
#
function bin2dec(arr::Array{Bool,1})
sum=0
for i=1:7
sum += arr[i]*2^(i-1)
end
sum - arr[8]*128
end
To use this, we need to start with a binary array, so let’s start with
Note that if we try to add 01001100 and 01100010 we get 10101110 (try it!) and if we use the conversion to get back to decimal results in bin2dec(str2arr("10101110")) or -82, which clearly is not the sum of two positive numbers. The reason this occurred is that the sum of the two numbers does not fit in an 8-bit integer, whose max value is 127. We can tell this directly from the binary result in that the first digit is a 1 (due to the sum of the two one’s in the second position). Recall that this first digit is a sign bit. This is an overflow error.
The unary minus operator negates a number. To see how to do this in binary, let’s look at what a bit flip looks like with a few examples. Let’s take the decimal number 54 or 00110110 and flip all of the bits (changes 0s to 1s and 1s to 0s) to get 11001001 and using bin2dec(str2arr("11001001")) which returns -55. If we use 10101010 which is \(-86\text{,}\) the bit flips are 01010101 which is \(85\text{.}\) Note that the sum of a binary number and it’s bit flips is 11111111 which is \(-1\text{.}\)
To determine the negation of a binary number, we will 1) flip the bits and 2) add one. So for example, 00101101 or 45. The bit flip of this is 11010010 and then add one to get 11010011, which is -45.
There are a couple of ways to think about subtraction. One can use the column algorithm like is standard for subtraction of decimal numbers and adapt to binary integers. However an easier way is to thing of \(a-b\) as \(a+(-b)\) and use the negative from the previous section. For example, look at \(78-85\text{.}\)
Create a unary minus, plus and minus on the type Integer8 types. Use the example of Polynomials in section \ref{sect:parametric-types} for examples on how to do this.
Recall that any number written in decimal form with only a finite number of digits can be written in scientific notation that is in the form: \(a \times 10^{b}\)
where \(1<|a|<10\) and \(b\) is an integer. For example \(4003.23\) can be written as \(4.00323 \times 10^{3}\text{,}\) so \(a=4.00323\) and \(b=3\text{.}\)
In this form the number \(a\) is often called the \textbf{significand} or \textbf{mantissa} and the number \(b\) is \textbf{exponent}. This example has the base 10, however other bases are common (generally base 2).
One major advantage to using numbers in this form is the simple multiplication and division. Consider multiplying \(x=3.4 \times 10^{2}\) and \(y=-4.7 \times 10^{-3}\text{.}\) Using properties of exponentials we get
Division can be done in a similar manner and perhaps surprisingly, addition and subtraction are more difficult due to the fact that the exponents of the two numbers need to be equal before adding and subtracting.
The reason for using floating point numbers in calculations is twofold. First, there is a finite size of storage for a number and secondly, routines for performing operations on floating-point numbers are fast and usually encoded on a computer chip.
Consider a floating point of a given size, say 64 bits generally called a double precision floating point number. The first bit is generally used for the sign, the next 11 are the exponent and the final 52 bits store the mantissa. A floating point number has two limitations and that is the precision (how many digits that can be stored) and the magnitude (the largest number). Double precision numbers are store in binary and converted to decimal with the form: