Objectives
-
Accessing data and from packages.
-
Basics of dataframes, the essential part of doing data analysis.
-
Chaining functions together with a pipe command or the chain macro.
-
Loading data from a CSV or excel file.
DataFrames. Then, data will be loaded from a file. In addition, a package will be used to string together operation on dataframes.
RDatasets package
RDatasets has many of these datasets an ensure that these are downloaded and then use using RDataSets.
RDatasets.packages()
34×2 DataFrame 9 rows omitted
Row Package Title
String15 String
1 COUNT Functions, data and code for count data.
2 Ecdat Data sets for econometrics
3 HSAUR A Handbook of Statistical Analyses Using R (1st Edition)
4 HistData Data sets from the history of statistics and data visualization
RDatasets.datasets("datasets")
50×5 DataFrame 25 rows omitted
Row Package Dataset Title Rows Columns
String15 String31 String Int64 Int64
1 datasets BOD Biochemical Oxygen Demand 6 2
2 datasets CO2 Carbon Dioxide Uptake in Grass Plants 84 5
3 datasets Formaldehyde Determination of Formaldehyde 6 2
4 datasets HairEyeColor Hair and Eye Color of Statistics Students 32 4
5 datasets InsectSprays Effectiveness of Insect Sprays 72 2
6 datasets LifeCycleSavings Intercountry Life-Cycle Savings Data 50 6
7 datasets Loblolly Growth of Loblolly pine trees 84 4
dataset function and it is not exported from the module, so must be called as the full path. For example, we can load the classic iris dataset from this to view:
iris=RDatasets.dataset("datasets","iris")
150×5 DataFrame 125 rows omitted
Row SepalLength SepalWidth PetalLength PetalWidth Species
Float64 Float64 Float64 Float64 Cat…
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
iris DataFrame are different types, but it is important that each column has a fixed data type. This data type is always at the top of the column.
Any and then anything can go there, but anytime julia doesn’t know a data type, it slows down calculations tremendously. A better way to think of a DataFrame is that of a collection of Vectors (1D arrays), where each array (column) has a name.
iris[2,3] will return the element in the 2nd row and 3rd column. Many of the array accessing methods from Chapter 6 also work.
iris[!,:PetalLength] returns a 1D Array (of Float64). The ! in the first slot means to use all rows. If we want the 11th through the 20th row of the SepalWidth column, enter
iris[11:20,:SepalWidth]
10-element Vector{Float64}:
3.7
3.4
3.0
3.0
4.0
4.4
3.9
3.5
3.8
3.8
DataFrame in a Jupyter notebook, is that there is a limit on the number of columns and rows shown. There is a fixed number of rows (25) shown at the top and bottom of the data frame. To change this (and often I think it shows too many rows), set the DATAFRAMES_ROWS variable in the following way:
ENV["DATAFRAMES_ROWS"] = 10
DATAFRAMES_COLUMNS, like above. (I haven’t had a dataset with more than this number of columns.)
DataFrame can be created by specifying the columns. Consider:
data = DataFrame(
A = 1:2:13,
B = ["M", "F", "F", "M","X","F","M"],
C=[3.0,2.5,pi,-2.3,1/3,56,100],
D=[(-1)^n//n for n=1:7]
)
A, B, C and D and Julia interprets the data types of each column. Note: this can all be put on a single line, but for readability, the 4 columns are place on separate rows. The result of this is
7×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 1 M 3.0 -1//1
2 3 F 2.5 1//2
3 5 F 3.14159 -1//3
4 7 M -2.3 1//4
5 9 X 0.333333 -1//5
6 11 F 56.0 1//6
7 13 M 100.0 -1//7
size(df) similar to an array, returns the number of rows and columns.
names(df) Names of the columns as an array
describe(df) summary of basic statistics of each column of the dataframe.
DataFrame with the describe function. For example, describe(iris) returns
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 SepalLength 5.84333 4.3 5.8 7.9 0 Float64
2 SepalWidth 3.05733 2.0 3.0 4.4 0 Float64
3 PetalLength 3.758 1.0 4.35 6.9 0 Float64
4 PetalWidth 1.19933 0.1 1.3 2.5 0 Float64
5 Species setosa virginica 0 CategoricalValue{String, UInt8}
describe is itself a DataFrame. For numerical data, the mean, min, median and max are clear but for Categorical data, only the min and max are listed with the min and max sorted categories only listed.
mean or std directly if you have loaded the Statistics package (which you may have to add) with the command using Statistics. Now, we can find the mean and standard deviation of columns with
mean(iris[!,:SepalLength]),std(iris[!,:SepalLength])
(5.843333333333334, 0.828066127977863)
Statistics package also has the functions to compute the variance, weighted means, median and quartiles. The julia documentation on this package has more information.
simpsons = DataFrame(
id = 1:2:13,
name = ["Homer", "Marge", "Lisa", "Bart", "Maggie", "Apu", "Moe"],
age = [45, 42, 8, 10, 1, 38, 59],
salary = [50000, 25000, 10000, missing, missing, 45000, 3000],
favorite_food = ["pork chops", "casserole", "salad", "hamburger", missing, "saag paneer", "peanuts"]
)
describe(simpsons), we get the following results:
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 id 7.0 1 7.0 13 0 Int64
2 name Apu Moe 0 String
3 age 29.0 1 38.0 59 0 Int64
4 salary 26600.0 3000 25000.0 50000 2 Union{Missing, Int64}
5 favorite_food casserole salad 1 Union{Missing, String}
nmissing which reports the number of missing elements in the column is not zero for the salary and favorite_food columns. Also, the eltype short for element type, is a Union type meaning that element in the column are either of Missing type (explained in Chapter 33) or of another type (either Int64 or String).
sqrt(sin(big(2.0)))
0.9535708819095105398718705441330078869495314128723191646104050462287315327622005.
2.0 and apply each of the functions as the expression is worked through. In many computing environments, including Julia, this is possible with the pipe |> to rewrite the order of operations. The above example can be written:
2.0 |> big |> sin |> sqrt
big function, then to the sin function, then to the sqrt function. This can also be thought of as Reverse Polish Notation or postfix notation and once understood, it is quite helpful in tackling complex tasks.
mean([1,2,3,5,6,7]), but let’s look at another way.
1:7.
1, 2, 3, 4, 5, 6, 7.
1:7 |>
collect |>
arr -> deleteat!(arr, 3) |>
mean
4.0 where this is often clearer in that each step is on a separate line and does a separate thing.
Chain package
Chain package simplifies some of these steps up. First, make sure that the package is added and then loaded with using Chain. Overall, a number of steps are done to an initial piece of data. Above, we saw an array, in the next chapter we will see this with a DataFrame. The general structure is
@chain data begin
# functions applied to data in order
end
|>. For example, the removal of an item from an array can be written:
@chain 1:7 begin
collect
deleteat!(4)
mean
end
@chain results in the previous line always been applied to the function on the current line, so the collect line is really collect(1:10), what happens with the third line? Since deleteat! requires two arguments, @chain automatically puts the result above into the first slot of the function. We’ll see below if we need to do something different.
[1,2,3,4,5,6,7,8,9,10], then@chain block.
@chain 1:10 begin
map(x -> x^2, _)
filter(x -> x % 2 == 0, _)
std
end
UnitRange of 1:10. This is just for simplicity. Secondly, you’ll notice that there is an underscore character _ in the second and third lines. This means that the result of the previous line should be put where the _ is. And since the array (or the Unit Range) needs to be in the second slot of the map function, we need to use the underscore.
CSV package to load files. Recall that you will need to add and load it (with using CSV) before being able to use it.
ipynb file is. We will load it with
simple = CSV.read("simple.csv", DataFrame)
6×5 DataFrame
Row Name Age Sex Shoe Size Pay Rate
String15 Int64 String1 Float64 Float64
1 Alice 18 F 6.5 15.5
2 Barry 24 M 12.0 18.25
3 Charlize 45 X 8.0 22.5
4 Doreen 72 F 7.5 24.5
5 Edgar 64 M 10.0 33.0
6 Francine 16 F 9.0 18.75
CSV.read requires that the 2nd argument is the output of the read (what do we want to do with the result of the file?). Although there are other options, DataFrame is the standard.
Name column has the datatype String15 and the Sex column has datatype String1. These are defined in the package InlineStrings and there are a few related datatypes that can be used to store Strings in an efficient manner.
Shoe Size and Pay Rate have spaces in them. Although we can deal with this, we will have to refer to name like :"Shoe Size". We can reload the file with the normalizenames keyword argument as follows:
simple = CSV.read("simple.csv", DataFrame, normalizenames = true)
_.
missingstring option to tell the CSV.read command to covert a string, like "NA" to missing.
xlsx file). This is generally not a good format for data for two main reasons. First, it is a proprietary format—although a lot of different software can read and write this format. Secondly, these files can contain spreadsheet functions, so it can contain more than data.
XLSX package and take a look at the documentation. Don’t forget to download it and load it. The same dataset from Section 31.4 is available for download as simple.xlsx and make sure that is file is located in the same directory as your .ipynb file.
xf = XLSX.readxlsx("simple.xlsx")
XLSXFile("simple.xlsx") containing 1 Worksheet
sheetname size range
-------------------------------------------------
Sheet1 7x5 A1:E7
Sheet1 and the cells where the data is is listed as well. Since we typically want to load data into a DataFrame, we can do that with
df = DataFrame(XLSX.readtable("simple.xlsx", "Sheet1"))
6×5 DataFrame
Row Name Age Sex Shoe Size Pay Rate
Any Any Any Any Any
1 Alice 18 F 6.5 15.5
2 Barry 24 M 12.0 18.25
3 Charlize 45 X 8.0 22.5
4 Doreen 72 F 7.5 24.5
5 Edgar 64 M 10.0 33.0
6 Francine 16 F 9.0 18.75
Any and recall if performance is important, one main idea in Julia is that DataTypes should be set when possible. The XLSX.readtable function has the option infer_eltypes=true that attempts to determine the type, however, if we reload the file with
df = DataFrame(XLSX.readtable("simple.xlsx", "Sheet1", infer_eltypes=true))
String, Int64, String, Float64 and Float64. Notice that the String columns did not use the more efficient String15 and String1 type. This isn’t a big deal in this small case, but may with a larger dataframe. This can be updated with (I think this probably happened because some of the numbers are integers, while others are floats and it isn’t smart enough to just make everything a float). We can fix this by converting the last two columns to the proper type. The following lines will do this:
df[!,:Name] = convert.(String15, df[!,:Name])
df[!,:Sex] = convert.(String1, df[!,:Sex])
df in its own cell, you’ll see that the column datatypes now match the types above.
XLSX does not, but can fix this problem with:
rename!(df, :"Shoe Size" => :Shoe_Size, :"Pay Rate" => :Pay_Rate)
rename! with the ! to replace the existing Dataframe. The result of this should show that the column names have changed.