Introduction to Numerical Data Types

Chapter 3 Introduction to Numerical Data Types

Objectives

Details about integers including the limitations of integers in computers.
🔗

🔗
The subtypes of integers especially signed and unsigned integers as well as big versions.
🔗

🔗
Details of floating-point numbers on computers and how they different from real numbers and decimal numbers.
🔗

🔗
Introduction to BigFloats, extended versions of floating point numbers.
🔗

🔗
Introduction to complex and rational numbers.
🔗

🔗
Definitions and differences between abstract and concrete datatypes and their hierarchy.
🔗

🔗
Converting between different types of numbers and parsing strings as numbers.
🔗

🔗
Rounding numbers to the nearest integer and returning as an integer or other type.
🔗

🔗

In Chapter 2, we saw a little bit about number and string data types. This chapter goes into greater detail about numbers and other datatypes mainly with limitations and when to use what type of number.

🔗

Section 3.1 Integers

Recall that mathematically, an integer is a counting number \(1,2,3, \ldots\) along with 0 and the negative counting numbers \(-1,-2,-3,\ldots\text{.}\) Mathematically thinking, there is no limit on the largest or smallest integer, however, in reality if we are storing a number on a finite device such as a computer, there must be a limit on the smallest and largest integers to be stored. This section will determine this.

🔗

Subsection 3.1.1 Signed Integers

Let’s say we need store an integer number, say that counts the number of appearances of a substring in a larger string. By default, there is a type Integer that handles this perfectly. This is generally a type alias for Int64 which is an integer that uses 64 bits for its storage.

This is default of most current computers, which are 64-bit computers (it doesn’t really matter what this means), but in short 64-bit computers are optimized for handling numbers that are 64-bit. We’ll see 64-bit floating point numbers in the next section.

Unless you need something specialized, Int64 is probably what you will use.

🔗

If you know you need an integer, it is important to know that there is a largest and smallest very of this number. For the Int64 type, we can get these with typemin(Int64) and typemax(Int64). These return -9223372036854775808 and 9223372036854775807 respectively and many scientific computing projects will fit within these values.

🔗

Int64 is known as a signed integer because this type can store negative and positive numbers. The other signed integers are Int8, Int16, Int32 and Int128. The differences are the min and max values that can be stored. As discussed above, generally you can use Int64 for most cases. The smaller ones may be needed if total memory space is an issue and Int128 can be used if you need numbers outside the range of values of Int64.

🔗

We will look at details of the typing system in Chapter 21 and details of storing integers is in Appendix C.

🔗

Check Your Understanding 3.1.

Find the largest and smallest versions of a Int8. Note: this will be important in discussion in the next section.

🔗

Subsection 3.1.2 Overflow and Underflow of integer operations

Above, we discussed that there is a max and min value of a variable with an integer type. Even if a value fits within the given range of values, operations with numbers of that type may result in too large of a number to fit. This is known as overflow.

🔗

Here’s a simple example with 8-bit integers. Let x=Int8(95) and y=Int8(70). The sum of 95 and 70 is 165 and above the maximum value for Int8. However, entering x+y results in -91, not either the expected result or an error.

🔗

What just happened? If you want to know why the value of -91 arose, dig into the details in Appendix C, but the reason why there was no overflow error is that Julia does not automatically check for such errors, due to the fact that there is overhead in checking, which will slow down operations.

🔗

If you want to check, there are a suite of operations that will check. For example Base.checked_add(x,y) will return

ERROR: OverflowError: 95 + 70 overflowed for type Int8

🔗

Go to Julia’s documentation on checked_add which starts a list of functions that will check for over and underflow. If there is any chance of overflow/underflow errors, then the results may be wrong. Keep this in mind as in Chapter 23 we will write tests for code.

🔗

Although no example is show here, if an operation results in too small a value, the operation will underflow.

🔗

Subsection 3.1.3 Unsigned Integers

First, we will examine unsigned integers and typically these are thought of as the nonnegative integers (0,1,2,3, ...). For example, 8-bit unsigned integers have a total of \(2^8=256\) nonnegative numbers and since the smallest is 0, the largest is 255.

🔗

As shown above, the typemin and typemax methods will determine the smallest and largest values. If we try this on UInt8, though, we get 0x00 and 0xff respectively. A number that starts with 0x is in hexadecimal and we can convert it to decimal with the Int function

Yes, Int is both a type and a method. Most number types will attempt to create a number with that type. Int32(100) will make a 32-bit integer with the decimal number 100 as its value.

We can wrap Int around the typemin and typemax functions such as Int(typemin(UInt8)) and Int(typemax(Unit8)) returns 0 and 255. This should be as expected because the total number of values an 8-bit number can store is \(2^8=256\) and if 0 is the smallest then 255 is the largest.

🔗

In Julia, the datatypes for unsigned integers are UInt8, UInt16, UInt32, UInt64 and UInt128 and as discussed above, you typically don’t need these unless for a specific reason.

🔗

Note 3.2.

Despite what said above, UInt8 is commonly used in digital photographs. This is mainly for images that are 8-bit images means that either the grayscale channel (for B/W images) or the standard Red, Green and Blue channels using 256 values per channel and in this case UInt8 fits this requirement and is used for saving space.

🔗

A typical photo may be a few thousand pixels in each directly and if a 64-bit integer was used for say a 4000 by 3000 pixel color image, the total number of bits would be

🔗

\begin{equation*} 4000 \cdot 3000 \cdot 64 \cdot 3 = 2.304 \times 10^9 \end{equation*}

🔗

bits or 288 million bytes (288 mega bytes). Using 8-bit integers instead results in 36 Megabytes. This is for an uncompressed image.

🔗

Also, images on more current devices are using more levels than 255 (often in high dynamic range photos), so UInt16 may be used at the pixel level.

🔗

Appendix C covers many of the details of representation integers in binary and performing basic operations. We will cover the a superficial level of integer representation and operations here in this chapter, but for those with desire for more depth see Appendix C.

🔗

Section 3.2 Floating Point Numbers

Many fields in scientific computing rely on using decimals and the standard way to store these in a computer is with floating point numbers. Details on floating-point numbers are in Appendix C. Julia has 16-,32- and 64-bit floating point numbers called Float16, Float32 and Float64 and the default on most systems is the Float64 again because most operating systems currently are 64-bit operating systems.

🔗

Floating point numbers are great for performing operations on decimals. There are four limitations on them however. These are

🔗

There are minimum and maximum values for a given type. A rule of thumb is that Float16 has a maximum about 65,000, Float32 has a maximum about \(10^{38}\) and Float64 has a maximum about \(10^{307}\text{.}\)
🔗

If you try to find these with typemin and typemax, however, such as typemin(Float32) and typemax(Float32), Julia will return -Inf32 and Inf32. These are the smallest and largest values in the Float32 types and are rendered as these values. The Inf part alludes to infinity and is used to handle overflow and underflow. If we try Float32(10^5)+Inf32, we get Inf32.
🔗

For floating point numbers, the methods floatmin and floatmax return the smallest non-infinite values. For example, floatmin(Float64) and floatmax(Float64) return 2.2250738585072014e-308 and 1.7976931348623157e308.
³
The e308 stands for \(10^{308}\) and is explain later in this section.
Notice that the first is not a negative number, but the smallest positive number that can be store.
🔗

🔗
The next limitation of floating-point numbers is that of precision, that is the number of digits that are able to be stored. Let’s look at finding the floating point for 1/3 in the three floating point types and as you will se
🔗

Float16(1/3) returns Float16(0.3333). Float32(1/3) returns 0.33333334f0 and Float64(1/3) returns 0.3333333333333333. This shows (for 1/3, but is close for other numbers) that Float16 stores 4 digits, Float32 stores 7 digits and Float64 stores 16 digits. These can vary a bit depending on the number that is being stored.
🔗

If you using floating point numbers, make sure that you have the precision that you want. As with integers, the default value (Float64) is generally good all around, but you may need more precision
⁴
You, the astute reader is thinking that you just mentioned these three floating point types of numbers and Float64 has the highest precision of the 3. How can we get more? The answer will be shown below, but there is a type called BigFloat that can give any level of precision (within reason).
🔗

🔗
Decimal numbers can rarely be stored exactly as a floating point number. Even though 0.4 looks like it is stored exactly. That is, entering 0.4 just returns 0.4 and similar with 0.2, they aren’t exactly those values. In fact if you add these and enter 0.2+0.4 the result is 0.6000000000000001, which is very close, but not exactly 0.6.
🔗

Generally this isn’t a problem especially if we are only looking for a few digits of accuracy anyway and this is common that there are other sources of error. However, if you need decimals that act like mathematical decimals, there is a package called Decimals that can be used to handle issues like this. Alternatively since \(0.2 = \frac{1}{5}\) and \(0.4 = \frac{2}{5}\text{,}\) if we looks at these as rational numbers, we can use the Rational type explain below in Section 3.5.
🔗

🔗
The fourth issue around floating point numbers is that of round-off errors. This is related to both the precision of a floating point type as well as the inability to store decimals exactly. This is often a huge problem and we will visit this with a specific example in Chapter 10.
🔗

🔗

🔗

Section 3.3 Extending integers, the `BigInt` type

In Chapter 9, we will explore prime numbers and it is common for them to exceed the maximum allowable Int64 or even Int128. If this is needed, there is a type called BigInt with no maximum or minimum. Here’s the number one googol

big(10)^100

which returns

10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

🔗

This integer is 1 followed by 100 zeros. Note: the command big creates a BigInt and generally normal operations with integers result in BigInts as well.

🔗

It’s very important to understand how big(10)^100 works. First a number of type BigInt is made with a value of 10. Then that is raised to the 100th power. As noted earlier, a BigInt doesn’t have a upper or lower limit on the number. It can grow as needed.

🔗

If instead, we enter big(10^100), the result is 0, a surprising result, however note that because of order of operations, first 10^100 is calculated in standard Int64 and then turned into a BigInt. Again, for details on what happens here, look at Appendix C, in short this continually multiplies the number ten, 100 times which results in overflow and as such results in 0.

🔗

It is recommended only to use a BigInt if needed. Operations with them are significantly slower than Int64 or even Int128. Under few cases do you need to do this, however, we will point out in Chapter 9 with prime numbers when we might need to use them.

🔗

Section 3.4 Extending Floating Point Numbers with `BigFloat`

As we discussed above, a floating point number has two limitations 1) the number of digits stored and 2) the maximum exponent used. If we run into numbers that exceed either of these, we can turn to BigFloat. For example, what if you want to calculate \(\pi\) to 100 or more digits. See Chapter 48 for details on how to do this.

🔗

To get a floating point number of type BigFloat, wrap the big function around a float. For example x=big(0.25) returns 0.25 and we can verify it’s type with typeof(x) which returns BigFloat.

🔗

Let’s revisit an example from earlier and sum 1/9 nine times. If we try to turn this into a BigFloat with a=big(1/9), then the result is

0.111111111111111104943205418749130330979824066162109375

🔗

This seems to have an accuracy of only 17 digits, which is typical for a 64-bit floating point, so it doesn’t appear to have improved anything. This, like above, is a case of being careful in constructing a BigFloat. What happens with big(1/9)? Put on order-of-operations hat and let’s take a look. The 1/9 is done first and since both 1 and 9 are regular integers (Int64), the result is a Float64. Then the big function turns the Float64 into a BigFloat, but not we the accuracy expected.

🔗

Instead, if we define a=big(1)/big(9), then we get

0.1111111111111111111111111111111111111111111111111111111111111111111111111111109

🔗

which looks more like an expected result. To determine the number of digits of accuracy, you can count (painfully) or try length(string(a)) which will return 81, which is about 5 times the accuracy of Float64. Technically if you type precision(a) and this returns 256, which is the number of bits and it has 4 times the binary precision of Float64 but about 5 times the decimal precision.

🔗

Note: looking again at order of operations, the command length(string(a)) first takes the number a and returns it as a String. Then working inside to outside, find the length of the string.

🔗

As noted at the beginning of this section, though, if we want to compute \(\pi\) to 1 million decimal digits

⁵

Don’t peek yet, but, really, we will do this in Chapter 48.

, what we’ve seen so far only has about 80 digits of accuracy. However, the BigFloat type is quite flexible. The above example used it in its default precision. We can change this with the setprecision function. For example:

setprecision(2^10)

🔗

returns 1024, showing that now BigFloats will be stored in this many bits. Such a number will using 16 times the number of binary digits as a Float64. Entering a2=big(1)/big(9) returns

0.111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

🔗

a number that, at a glance, looks to store about 4 times the number of digits as the default BigFloat size. To determine the total number of decimal digits, if we enter

 length(string(big(a2)))

🔗

returns 311 as the number of decimal digits. This is about 20 times the precision of Float64.

🔗

Subsection 3.4.1 Limitations of `BigFloats`

Although a BigFloat stores more digits and allows a wider range of possible numbers, it still has a limitation. Once you create a BigFloat, it uses the given precision from setprecision and cannot be changed. So typically, thought needs to be put in before using setprecision on what the desired level of precision that is needed.

🔗

Additionally, operations with BigFloats are significantly slower that the Float64 type and unless you need the extra precision, stay with the base type.

🔗

Section 3.5 Rational numbers

A rational number is the ratio of integers and more commonly called a fraction. Julia (unlike many other general computing languages) has rational numbers built-in. To put in a ratio, enter a // to separate the numerator and denominator. For example:

2//3
4//7
178//11
-1//2

🔗

are examples of rational numbers. One advantage that they have is that the numerator and denominator are stored as integers (64-bit by default) and are not subject to round-off errors that floating points are. The standard operations \(+,-,\cdot,\div\) between rationals results in a rational and as we will see in this course, there are advantages to using rationals instead of floating points.

🔗

Subsection 3.5.1 Exercise

Check Your Understanding 3.3.

Perform the following operations involving rationals in Julia:

🔗

(a)

\(\dfrac{1}{2} + \dfrac{2}{3}\)

🔗

(b)

\(\dfrac{1}{2} - \dfrac{2}{3}\)

🔗

(c)

\(\dfrac{2}{3} \cdot \dfrac{3}{5}\)

🔗

(d)

\(\dfrac{2}{3} \div \dfrac{3}{5}\)

🔗

Subsection 3.5.2 The Rational Type

If you enter typeof(1//2), note that Julia returns Rational{Int64} and this is called a Parametric Composite Type, which will be talked about later. In this particular case, this is a rational type, but inside it (the numerator and denominator), they are type Int64. For example, to make a different type of rational you need to declare a different integer type inside, enter

Int16(1)//Int16(2)

🔗

and if you check the type of this, you will see it is of type Rational{Int16}.

🔗

Since operations with rational do not round off, like with floating-point numbers, you may want to consider using them. For example, if we

1//9+1//9+1//9+1//9+1//9+1//9+1//9+1//9+1//9

🔗

results in 1//1, which is actually the number 1 (written as a rational number).

🔗

Check Your Understanding 3.4.

Build the rational \(\frac{1}{2}\) with BigInts within it. Check that in fact it is stored as you expect.

🔗

Subsection 3.5.3 Other Operations with Rationals

As long as you stay in the basic operations, \(+,-,\cdot, \div\text{,}\) the result will be a rational. However, many other operations are not. For example sin(1//2) will return a floating-point number. (2//3)^3 will return a rational, since this is ultimately multiplication, but (2//3)^(1//2) will return a floating-point since raising a number to the \(1/2\) is the same as square root.

🔗

Subsection 3.5.4 Limitations of Rational Numbers

It seems like any time that you have fractions that you should simply use rational data types. For example, let’s say we want to add the following:

1//1 + 1//2 + 1//3 + 1//4 + 1//5 + 1//6 + 1//7 + 1//8 + 1//9 + 1//10

which we can write with the sum function that we will see in Chapter 7 as

sum(i->1//i,1:10)

and the result is 7381//2520. This seems great, but if we sum the first 50 reciprocals with sum(i->1//i,1:50), then the result is an overflow error. Note that rationals check for overflow and underflow, which is different than integers and floats which do not check.

🔗

Section 3.6 Complex Numbers

Recall that the imaginary number \(i\) is defined as \(\sqrt{-1}\text{.}\) A complex number is a number of the form \(a+bi\) for \(a\) and \(b\) in general real numbers. In Julia, there is a built-in constant im, which can be used to create complex numbers. For example z=1+2im has type Complex{Int64}, which is a composite type with the values of the numbers \(a\) and \(b\) are Int64.

🔗

Complex numbers play a huge role in many aspects of scientific computing and in some cases, formulating a problem using complex numbers can make calculations faster and easier to program. This includes the very important algorithm, fast-Fourier transforms or better known as FFTs. This and other examples are explained in Chapter 45.

🔗

Section 3.7 Converting numbers

float(x) converts any type of number to a floating point, like

🔗

 float(1//3)

🔗

returns 0.3333333333333333

🔗

More generally is the convert method which has a template of convert(TYPE,value) which attempts to convert value to type TYPE. We can also convert the rational 1//3 to a float with

🔗

convert(Float64,1//3)

🔗

and we can use this to convert say a 64-bit floating point to a 16-bit floating point with:

🔗

convert(Float16,1/3)

🔗

which returns Float16(0.3333) Similarly, we can parse strings to different number types. parse(TYPE,str) parses str of type String into a number of type TYPE.

🔗

parse(Int,"1234")

🔗

returns the integer 1234 and

parse(Float64,"1234")

🔗

returns the floating point 1234.0.

🔗

If you have a number in another base, you can still parse it. For example, consider the binary number 10011,

parse(Int,"10011",base=2)

results in 19.

🔗

Subsection 3.7.1 Rounding to Integers

If you have a floating point number or rational and you want to convert to an integer, typically use the ceil, floor, round functions. For example

🔗

round(Int,3.2)

🔗

returns the integer 3 using the standard rounding function with rounding to the nearest integer and if the decimal part is 0.5 the rounding goes up. round(Int, 4.5) returns 5.

🔗

The function ceil rounds up to the nearest integer and floor rounds down. So ceil(Int, 3.2) returns 4 and floor(Int, 3.9) returns 3.

🔗

We saw above how floating point numbers can be rounded to integers, but all of examples used positive integers. The following exercise determines how rounding works with negative floating numbers.

🔗

Check Your Understanding 3.5.

(a)

Try rounding using round, ceil and floor negative floating points with some non-integer parts.

🔗

(b)

Explain how rounding negative floating point numbers works with round.

🔗

(c)

Explain how rounding negative floating point numbers works with ceil.

🔗

(d)

Explain how rounding negative floating point numbers works with floor.

🔗

Hopefully you didn’t skip that last exercise. If not, you should have see that round returns the integer away from zero. For example, round(Int, -2.5) returns -3.

🔗

Subsection 3.7.2 Rounding to Floats or other number types

Although rounding usually goes to integers, the first argument of round, ceil and floor is a data type and if you want the result to be convert to that type, include that type. So round(Float64, 3.2) returns 3.0.

🔗

The methods floor and ceil also work the same way. For example, floor(Rational{Int}, 4.5) returns the rational 4//1.

🔗

Prev Top Next