Skip to main content

Chapter 31 Introduction to Data Analysis

Data is ubiquitous these days and being able to analyze it is crucial. In this section, we will first load a dataset that is available as a Julia package, plot the data and do some basic analysis of DataFrames. Then, data will be loaded from a file. In addition, a package will be used to string together operation on dataframes.

Section 31.1 Data from the RDatasets package

The statistical software/language called R has many built-in and loadable datasets. A julia package called RDatasets has many of these datasets an ensure that these are downloaded and then use using RDataSets.
The datasets are subdivided into package and you can see the entire list of the packages with
RDatasets.packages()
The top of this output looks like:
34×2 DataFrame                       9 rows omitted
  Row  Package   Title
       String15  String
  1    COUNT     Functions, data and code for count data.
  2    Ecdat     Data sets for econometrics
  3    HSAUR     A Handbook of Statistical Analyses Using R (1st Edition)
  4    HistData  Data sets from the history of statistics and data visualization
Each of these is a package and a listing of all of the datasets in given package can be found with
RDatasets.datasets("datasets")
as an example. The top of this looks like:
50×5 DataFrame                                        25 rows omitted
 Row Package   Dataset          Title                                   Rows  Columns
     String15  String31         String                                 Int64  Int64
  1  datasets  BOD              Biochemical Oxygen Demand                   6  2
  2  datasets  CO2              Carbon Dioxide Uptake in Grass Plants      84  5
  3  datasets  Formaldehyde     Determination of Formaldehyde               6  2
  4  datasets  HairEyeColor     Hair and Eye Color of Statistics Students  32  4
  5  datasets  InsectSprays     Effectiveness of Insect Sprays             72  2
  6  datasets  LifeCycleSavings Intercountry Life-Cycle Savings Data       50  6
  7  datasets  Loblolly         Growth of Loblolly pine trees              84  4
Each row in this table is now a dataset with given number of rows/column. We can load the dataset with the dataset function and it is not exported from the module, so must be called as the full path. For example, we can load the classic iris dataset from this to view:
iris=RDatasets.dataset("datasets","iris")
The top of the dataset returns as:
150×5 DataFrame                                 125 rows omitted
  Row  SepalLength  SepalWidth	PetalLength   PetalWidth  Species
       Float64      Float64	    Float64     Float64     Cat…
  1    5.1          3.5	        1.4         0.2         setosa
  2    4.9          3.0	        1.4         0.2         setosa
  3    4.7          3.2	        1.3         0.2         setosa
  4    4.6          3.1	        1.5         0.2         setosa
  5    5.0          3.6	        1.4         0.2         setosa
  6    5.4          3.9	        1.7         0.4         setosa
The first line of the output of the dataset shows the size (number of rows and then number of columns). The second line is the column names, the third line is the datatype for each column. The first column is row number (not part of the dataset). The last column is a categorical datatype and more will be said about this later.

Section 31.2 Dataframes

In some ways, a DataFrame is like a 2D array in terms of accessing parts of the array, but in many ways it is very different. Notice that the columns of the iris DataFrame are different types, but it is important that each column has a fixed data type. This data type is always at the top of the column.
This is very different than a 2D array in that the entire array must have the same type. Julia has the ability to have an array of Any and then anything can go there, but anytime julia doesn’t know a data type, it slows down calculations tremendously. A better way to think of a DataFrame is that of a collection of Vectors (1D arrays), where each array (column) has a name.
We can access the DataFrame using some of the syntax from arrays. For example, iris[2,3] will return the element in the 2nd row and 3rd column. Many of the array accessing methods from Chapter 6 also work.
There are other nice features though as typically we want to do things on columns. For example: iris[!,:PetalLength] returns a 1D Array (of Float64). The ! in the first slot means to use all rows. If we want the 11th through the 20th row of the SepalWidth column, enter
iris[11:20,:SepalWidth]
which returns
10-element Vector{Float64}:
  3.7
  3.4
  3.0
  3.0
  4.0
  4.4
  3.9
  3.5
  3.8
  3.8
A perhaps-confusing result of using a DataFrame in a Jupyter notebook, is that there is a limit on the number of columns and rows shown. There is a fixed number of rows (25) shown at the top and bottom of the data frame. To change this (and often I think it shows too many rows), set the DATAFRAMES_ROWS variable in the following way:
ENV["DATAFRAMES_ROWS"] = 10
You may also see at the top of the output for a DataFrame that it may say that some of the columns are hidden. The default number of columns is 100 and in most environments, generally you can scroll left and right to see all of the columns. To change the number of columns, set the DATAFRAMES_COLUMNS, like above. (I haven’t had a dataset with more than this number of columns.)

Subsection 31.2.1 Constructing a DataFrame

Although nearly always, you will use a DataFrame by loading in a dataset, a DataFrame can be created by specifying the columns. Consider:
data = DataFrame(
  A = 1:2:13,
  B = ["M", "F", "F", "M","X","F","M"],
  C=[3.0,2.5,pi,-2.3,1/3,56,100],
  D=[(-1)^n//n for n=1:7]
)
which will create a DataFrame with column names A, B, C and D and Julia interprets the data types of each column. Note: this can all be put on a single line, but for readability, the 4 columns are place on separate rows. The result of this is
7×4 DataFrame
  Row  A     B      C        D
       Int64 String Float64  Rational…
  1    1     M       3.0      -1//1
  2    3     F       2.5       1//2
  3    5     F       3.14159  -1//3
  4    7     M      -2.3       1//4
  5    9     X       0.333333 -1//5
  6    11    F       56.0      1//6
  7    13    M       100.0     -1//7
There’s a number of helpful functions associated with DataFrames:
  • size(df) similar to an array, returns the number of rows and columns.
  • names(df) Names of the columns as an array
  • describe(df) summary of basic statistics of each column of the dataframe.
  • first(df,n) and last(df,n) prints the top \(n\) and bottom \(n\) rows of the dataframe.
Check Your Understanding 31.1.
Try running the above commands on the dataframe defined as data.

Subsection 31.2.2 Basic Statistics

As mentioned above, we can find summary statistics of a DataFrame with the describe function. For example, describe(iris) returns
5×7 DataFrame
  Row  variable     mean     min      median      max    nmissing  eltype
       Symbol       Union…   Any      Union…      Any    Int64     DataType
  1    SepalLength  5.84333  4.3      5.8         7.9    0         Float64
  2    SepalWidth   3.05733  2.0      3.0         4.4    0         Float64
  3    PetalLength  3.758    1.0      4.35        6.9    0         Float64
  4    PetalWidth   1.19933  0.1      1.3         2.5    0         Float64
  5    Species               setosa         virginica    0         CategoricalValue{String, UInt8}
Note that the output of describe is itself a DataFrame. For numerical data, the mean, min, median and max are clear but for Categorical data, only the min and max are listed with the min and max sorted categories only listed.
You can also call functions like mean or std directly if you have loaded the Statistics package (which you may have to add) with the command using Statistics. Now, we can find the mean and standard deviation of columns with
mean(iris[!,:SepalLength]),std(iris[!,:SepalLength])
will compute both and return: (5.843333333333334, 0.828066127977863)
The Statistics package also has the functions to compute the variance, weighted means, median and quartiles. The julia documentation on this package has more information.
For another example that we will also use below, consider this DataFrame:
simpsons = DataFrame(
  id = 1:2:13,
  name = ["Homer", "Marge", "Lisa", "Bart", "Maggie", "Apu", "Moe"],
  age = [45, 42, 8, 10, 1, 38, 59],
  salary = [50000, 25000, 10000, missing, missing, 45000, 3000],
  favorite_food = ["pork chops", "casserole", "salad", "hamburger", missing, "saag paneer", "peanuts"]
)
if we enter describe(simpsons), we get the following results:
5×7 DataFrame
Row  variable        mean     min       median   max    nmissing  eltype
     Symbol          Union…   Any       Union…   Any    Int64     Type
1    id              7.0      1         7.0      13     0         Int64
2    name                     Apu                Moe    0         String
3    age             29.0     1         38.0     59     0         Int64
4    salary          26600.0  3000      25000.0  50000  2         Union{Missing, Int64}
5    favorite_food            casserole          salad  1         Union{Missing, String}
There are some new things that occur here. First of all, we have missing data that we added to this DataFrame explicitly. Since this is true, the column nmissing which reports the number of missing elements in the column is not zero for the salary and favorite_food columns. Also, the eltype short for element type, is a Union type meaning that element in the column are either of Missing type (explained in Chapter 33) or of another type (either Int64 or String).

Section 31.3 The pipe command

Although this isn’t exclusive to data analysis, piping commands is common, so we address it here. First, let’s look at a non data example.
Consider a complicated nested example, like:
sqrt(sin(big(2.0)))
which returns 0.9535708819095105398718705441330078869495314128723191646104050462287315327622005.
In standard parsing rules, we work inside to the outside, which can get complicated. Often, it’s easier to start with the value 2.0 and apply each of the functions as the expression is worked through. In many computing environments, including Julia, this is possible with the pipe |> to rewrite the order of operations. The above example can be written:
2.0  |> big  |> sin  |> sqrt
and this returns the same result.
This expression can be thought of as starting with 2.0, then piping to the big function, then to the sin function, then to the sqrt function. This can also be thought of as Reverse Polish Notation or postfix notation and once understood, it is quite helpful in tackling complex tasks.
Let’s look at an example with a vector. We will find the mean of the numbers 1, 2, 3, 5, 6, 7. This can be done with mean([1,2,3,5,6,7]), but let’s look at another way.
  1. Start with the range 1:7.
  2. Expand it to the vector 1, 2, 3, 4, 5, 6, 7.
  3. Remove the 4th element.
  4. Take the mean.
This is done following these steps with the following:
1:7 |>
  collect |>
  arr -> deleteat!(arr, 3) |>
  mean
resulting with 4.0 where this is often clearer in that each step is on a separate line and does a separate thing.

Subsection 31.3.1 Using the Chain package

The Chain package simplifies some of these steps up. First, make sure that the package is added and then loaded with using Chain. Overall, a number of steps are done to an initial piece of data. Above, we saw an array, in the next chapter we will see this with a DataFrame. The general structure is
@chain data begin
  # functions applied to data in order
end
and the functions that are applied are written on separate lines without the |>. For example, the removal of an item from an array can be written:
@chain 1:7 begin
  collect
  deleteat!(4)
  mean
end
Hopefully, you can see that this syntax is cleaner and in many ways easier to read. Remember that using @chain results in the previous line always been applied to the function on the current line, so the collect line is really collect(1:10), what happens with the third line? Since deleteat! requires two arguments, @chain automatically puts the result above into the first slot of the function. We’ll see below if we need to do something different.
For another example, let’s say we want to do the following. Start with the vector [1,2,3,4,5,6,7,8,9,10], then
  1. square each element
  2. keep all even numbers
  3. find the standard deviation.
Basically each of these steps will be on a separate line within a @chain block.
@chain 1:10 begin
  map(x -> x^2, _)
  filter(x -> x % 2 == 0, _)
  std
end
A few things to notice here. First, instead of starting with the integer vector of numbers between 1 and 10, we use the UnitRange of 1:10. This is just for simplicity. Secondly, you’ll notice that there is an underscore character _ in the second and third lines. This means that the result of the previous line should be put where the _ is. And since the array (or the Unit Range) needs to be in the second slot of the map function, we need to use the underscore.

Section 31.4 Loading Data from a CSV File

As we will see in Chapter 38 and Chapter 39, data will be located in a file instead of typed in directly. In this section, we will look at how to load the data from a CSV (comma separated value) file, the most common format for data that has the form in which each row represents a different value with columns representing variables.
We will use the CSV package to load files. Recall that you will need to add and load it (with using CSV) before being able to use it.
Consider a simple CVS file. Download simple.csv and move it into the same directory where your ipynb file is. We will load it with
simple = CSV.read("simple.csv", DataFrame)
and the result is
6×5 DataFrame
  Row  Name      Age    Sex      Shoe Size  Pay Rate
       String15  Int64  String1  Float64    Float64
  1    Alice     18     F         6.5       15.5
  2    Barry     24     M        12.0       18.25
  3    Charlize  45     X         8.0       22.5
  4    Doreen    72     F         7.5       24.5
  5    Edgar     64     M        10.0       33.0
  6    Francine  16     F         9.0       18.75
  
The command CSV.read requires that the 2nd argument is the output of the read (what do we want to do with the result of the file?). Although there are other options, DataFrame is the standard.
With the output, notice a few different things about this:
  • The Name column has the datatype String15 and the Sex column has datatype String1. These are defined in the package InlineStrings and there are a few related datatypes that can be used to store Strings in an efficient manner.
  • The columns Shoe Size and Pay Rate have spaces in them. Although we can deal with this, we will have to refer to name like :"Shoe Size". We can reload the file with the normalizenames keyword argument as follows:
simple = CSV.read("simple.csv", DataFrame, normalizenames = true)
and the names will replace the space with an underscore _.
Another nice option that we will see in Chapter 39 is that of reading in missing data. Often a file will use a character or string to denote that the value is missing. We can use the missingstring option to tell the CSV.read command to covert a string, like "NA" to missing.
There are many other options to reading in a CSV file and you may need these. The documentation for the CSV package is quite good with many examples.

Section 31.5 Loading Data from an Excel File

Another common file format for data is that of a Microsoft Excel file (also known as a xlsx file). This is generally not a good format for data for two main reasons. First, it is a proprietary format—although a lot of different software can read and write this format. Secondly, these files can contain spreadsheet functions, so it can contain more than data.
We will use the XLSX package and take a look at the documentation. Don’t forget to download it and load it. The same dataset from Section 31.4 is available for download as simple.xlsx and make sure that is file is located in the same directory as your .ipynb file.
Because excel files have sheets, there is generally extra steps involved into loading the data. Starting with
xf = XLSX.readxlsx("simple.xlsx")
and this returns
XLSXFile("simple.xlsx") containing 1 Worksheet
sheetname size          range        
-------------------------------------------------
   Sheet1 7x5           A1:E7 
Note that this shows that there is only one sheet within the file. It is called Sheet1 and the cells where the data is is listed as well. Since we typically want to load data into a DataFrame, we can do that with
df = DataFrame(XLSX.readtable("simple.xlsx", "Sheet1"))
and this will return:
6×5 DataFrame
  Row  Name      Age   Sex   Shoe Size  Pay Rate
       Any       Any   Any   Any        Any
  1    Alice     18    F      6.5       15.5
  2    Barry     24    M     12.0       18.25
  3    Charlize  45    X      8.0       22.5
  4    Doreen    72    F      7.5       24.5
  5    Edgar     64    M     10.0       33.0
  6    Francine  16    F      9.0       18.75
  
There are a couple of issues with this DataFrame. First, the data types are all Any and recall if performance is important, one main idea in Julia is that DataTypes should be set when possible. The XLSX.readtable function has the option infer_eltypes=true that attempts to determine the type, however, if we reload the file with
df = DataFrame(XLSX.readtable("simple.xlsx", "Sheet1", infer_eltypes=true))
and the results are that the column types are now String, Int64, String, Float64 and Float64. Notice that the String columns did not use the more efficient String15 and String1 type. This isn’t a big deal in this small case, but may with a larger dataframe. This can be updated with (I think this probably happened because some of the numbers are integers, while others are floats and it isn’t smart enough to just make everything a float). We can fix this by converting the last two columns to the proper type. The following lines will do this:
df[!,:Name] = convert.(String15, df[!,:Name])
df[!,:Sex] = convert.(String1, df[!,:Sex])
and if you just type df in its own cell, you’ll see that the column datatypes now match the types above.
Another issue with this is the spaces in the column names that CSV had a fix for. XLSX does not, but can fix this problem with:
rename!(df, :"Shoe Size" => :Shoe_Size, :"Pay Rate" => :Pay_Rate)
where notice that we have used the rename! with the ! to replace the existing Dataframe. The result of this should show that the column names have changed.