Section 35.2 Filtering (or Subsetting) Rows
A common method with DataFrames is to filter the rows by some condition. If you want to return the first few rows, the last few rows or a sequence of them, we can use the same technique from
Chapter 6. For example, if we only want the first three rows, use
data[1:3,:] and if we only want rows 5 to the end, use
data[5:end,:]. If we want the odd rows, then
data[1:2:end,:].
The
first and
last commands from the
DataFrames package is also helpful. The first 3 rows can be found with
first(data, 3) and similarly, the last 4 rows can be returned with
last(data, 4). Use of these is nice within a
@chain block in that the previous technique is a bit more difficult.
There is a
filter command; however, we will use the
subset function instead. We’ll take a look at a number of examples of this within a
@chian block. To begin with, the following returns all rows where values in the
A column is less than 10.
@chain data begin
subset(:A => a-> a .< 10)
end
5×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 1 M 3.0 -1//1
2 3 F 2.5 1//2
3 5 F 3.14159 -1//3
4 7 M -2.3 1//4
5 9 X 0.333333 -1//5
There is a lot packed into this command. First of all, notice that we want to subset on the
A column so we use
:A => the rest of this is a function and we use an anonymous function that returns a boolean. Actually, it is a vector of booleans. Recall that select a column of a DataFrame is a vector and select expects a boolean vector to do the filtering. This is which there is a
.< instead of a
<. Remove the
. and you’ll see that there is an error.
For our next example, we select all rows where column
B equals "F". This can be accomplished with
@chain data begin
subset(:B => b-> b .== "F")
end
5×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 3 F 2.5 1//2
2 5 F 3.14159 -1//3
3 11 F 56.0 1//6
We can also subset by combinations of columns. For the next example, we’ll find where the product of columns
C and
D are greater than 2. We do this with
@chain data begin
subset([:C, :D] => (c, d) -> c .* d .> 2)
end
5×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 11 F 56.0 1//6
The broadcasting notation can be a bit tricky although nearly always one can write a
. in front of operators or append a
. to the end of function names. See
Section 6.5 for more details. Recall that the macro
@. will take an expression and vectorize (or broadcast it). Entering
@chain data begin
subset([:C, :D] => (c, d) -> @. c * d > 2)
end
Returns the same result as above. Also, in the
DataFrames package, there is a function called
ByRow that will perform a row-by-row version of a function which is often needed. The following
@chain data begin
subset([:C, :D] => ByRow((c, d) -> c * d > 2))
end
also returns the same DataFrame.
Check Your Understanding 35.1.
Using the
iris dataset, find all
setosa irises (look at the Species column) with
SepalWidth larger than 4. How many are there in the dataset?
Section 35.3 Transforming DataFrames
Another important technique needed with DataFrames is that of a transformation. In general, this means that a new column is created that is a function of one or more columns. There are two commands using DataFrames that transform dataframes,
select and
transform. The first is used to create a new DataFrame with generally only one or more transformed columns. The
transform method includes the original dataframe with the extra columns defined as the transformation. We’ll see the difference here.
If we generate a new DataFrame which is just the square of the
A column as
@chain data begin
select(:A => a -> a.^2)
end
7×1 DataFrame
Row A_function
Int64
1 1
2 9
3 25
4 49
5 81
6 121
7 169
Notice that the only column in the resulting DataFrame is the new column and it has the column name formed by appending
_function to the column name. We can give the column a better name setting this with the fat arrow
=> such as
@chain data begin
select(:A => (a -> a.^2) => :Asq)
end
and now the column has the name
Asq. Note that the parentheses around the anonymous function is needed due to order of operations. Remove them to see the effect.
In contrast if we use the
transform method, then the original matrix is preserved and the new columns is added at the end.
@chain data begin
transform(:A => (a -> a.^2) => :Asq)
end
7×5 DataFrame
Row A B C D Asq
Int64 String Float64 Rational… Int64
1 1 M 3.0 -1//1 1
2 3 F 2.5 1//2 9
3 5 F 3.14159 -1//3 25
4 7 M -2.3 1//4 49
5 9 X 0.333333 -1//5 81
6 11 F 56.0 1//6 121
7 13 M 100.0 -1//7 169
Both
select and
transform can use multiple columns. The following example, takes the product between columns
C and
D.
@chain data begin
select([:C, :D] => ((c,d) -> @. c*d) => :prod)
end
where the macro
@. has been used instead of using
.*. This returns
7×1 DataFrame
Row prod
Float64
1 -3.0
2 1.25
3 -1.0472
4 -0.575
5 -0.0666667
6 9.33333
7 -14.2857
Note: there are also commands
select and
transform that perform the operations within the given dataframe. This can be destructive, but can save a lot of computational time for large DataFrames. It is recommended to stay with the non mutating (without the !) versions unless you need to save time/memory.
Check Your Understanding 35.2.
Using the
iris dataframe produce a new column called
area which is the area of a petal using the
PetalLength and
PetalWidth variables and the area of an ellipse.
Section 35.4 Sorting DataFrames
The
DataFrames package also provides easy ways to sort. If we want to sort by the C column, we can do
or staying within a
@chain block
@chain data begin
sort(:C)
end
7×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 7 M -2.3 1//4
2 9 X 0.333333 -1//5
3 3 F 2.5 1//2
4 1 M 3.0 -1//1
5 5 F 3.14159 -1//3
6 11 F 56.0 1//6
7 13 M 100.0 -1//7
where the datatype determines the way to sort. Note that the rows are preserved. This is different than when we saw in
Chapter 6 which sorted every column in either increasing or decreasing order. In the case of sorting by column C, it is numerically in increasing order. If we want the reverse:
@chain data begin
sort(:C, rev = true)
end
Julia will detect the datatype automatically. If we sort by column B, via
@chain data begin
sort(:B)
end
7×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 3 F 2.5 1//2
2 5 F 3.14159 -1//3
3 11 F 56.0 1//6
4 1 M 3.0 -1//1
5 7 M -2.3 1//4
6 13 M 100.0 -1//7
7 9 X 0.333333 -1//5
which sorts lexiographically automatically. Notice that the other items are not sorted or kept in the same order as original. If we want to sort by another column, say
:D, we can do
@chain data begin
sort([:B,:D])
end
where the column names are placed inside an array. The results are
7×4 DataFrame
Row A B C D
Int64 String Float64 Rational…
1 5 F 3.14159 -1//3
2 11 F 56.0 1//6
3 3 F 2.5 1//2
4 1 M 3.0 -1//1
5 13 M 100.0 -1//7
6 7 M -2.3 1//4
7 9 X 0.333333 -1//5
And notice that within rows of constant B (like when it is "F"), the rows are then sorted by column D.
Section 35.5 Joining DataFrames
Often, data is split between different tables (or files) and they need to be joined before doing analysis. Let’s revisit the dataset from
Chapter 31:
simpsons = DataFrame(
name=["Homer","Marge","Lisa","Bart","Maggie","Apu","Moe", "Milhouse", "Patty"],
age =[45,42,8,10,1,38,59, 10, 46],
current_school_grade = [missing, missing, 2, 4, missing, missing, missing, 4, missing],
favorite_food = ["pork chops","casserole","salad","hamburger",missing,"saag paneer","peanuts", missing, "Lady Laramie 100s"]
)
jobs = DataFrame(
name = ["Homer","Marge","Apu","Moe", "Wiggam"],
job = ["nuclear technician", "housewife", "store owner", "bartender", "police chief"],
salary = [50_000, 25_000, 60_000, 15_000, 75_000]
)
and we want to join these two together. In order to join two datasets, one of the columns in each dataframe needs to match. In this case, the column that makes most sense is the
name column, which is common to both. For this example, we’ll use the
innerjoin command and the keyword attributed
on must be used with the format
col1 => col2 , where
col1 is in the first DataFrame and
col2 in the second one. The column that the sets are joined on are called the
key of the DataFrame. The
innerjoin command returns all rows in which the key is common to both datasets. The inner part is like the intersection of two sets. If we perform the following:
innerjoin(data, simpsons, on = :name => :name)
5×6 DataFrame
Row name age current_school_grade favorite_food job salary
String Int64 Int64? String? String Int64
1 Homer 45 missing pork chops nuclear technician 50000
2 Marge 42 missing casserole housewife 25000
3 Apu 38 missing saag paneer store owner 60000
4 Moe 59 missing peanuts bartender 15000
5 Patty 46 missing Lady Laramie 100s missing missing
Note: in this case, because the column name on the the two dataframes are the same, you can just do
on = :name.
You may be surprised that there are only 5 rows in this DataFrame, however, recall that
innerjoin merges two DataFrames, by including only where the
name column is in both of the two DataFrames. There are other ways to join Dataframes, and a summary is:
-
innerjoin
the output contains rows for values of the key that exist in all passed data frames.
-
leftjoin
the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
-
rightjoin
the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
-
outerjoin
the output contains rows for values of the key that exist in any of the passed data frames.
-
semijoin
Like an inner join, but output is restricted to columns from the first (left) argument.
-
antijoin
The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
-
crossjoin
The output is the cartesian product of rows from all passed data frames.
In the exmaple above, suppose we want a DataFrame with all keys (names). In this case, we will use the
outerjoin command. If we evaluate:
outerjoin(simpsons, jobs, on = :name)
9×6 DataFrame
Row name age current_school_grade favorite_food job salary
String Int64? Int64? String? String? Int64?
1 Homer 45 missing pork chops nuclear technician 50000
2 Marge 42 missing casserole housewife 25000
3 Apu 38 missing saag paneer store owner 60000
4 Moe 59 missing peanuts bartender 15000
5 Patty 46 missing Lady Laramie 100s missing missing
6 Lisa 8 2 salad missing missing
7 Bart 10 4 hamburger missing missing
8 Maggie 1 missing missing missing missing
9 Milhouse 10 4 missing missing missing
10 Wiggam missing missing missing police chief 75000
Section 35.6 Summarizing Data
Generally one wants to summarize a large dataset to a few numbers to get a sense or for comparison purposes. The
describe does this, but in a fixed way. A more general function to do this is called the
combine function. Consider the following:
combine(data, :C => mean, :D => mean, :C=>std, :C => length)
1×4 DataFrame
Row C_mean D_mean C_std C_length
Float64 Rational… Float64 Int64
1 23.2393 -319//2940 39.5518 7
And this returns a new dataframe with 1 row and 4 columns (mean of C, mean of D, standard deviation of C and the number of rows in C).
We will also use combine after splitting a dataset for comparison purposes. We’ll do this below in
Section 36.1
Section 35.7 Multiple Steps on a DataFrame
All of the previous examples in this chapter show a single step on a DataFrame. However, typically this is not how operations on DataFrames are not done. There are multiple steps. The next section will show this for a particular sequence of steps, but before moving onto that, let’s look at a series of steps. Consider the following steps:
-
Merge/join the
simpsons and
jobs DataFrames as shown above.
-
Make a new column that is the salary column minus 400 times the age. Call it
happiness.
-
Drop any rows in which the
happiness is missing.
-
Filter out any rows with the
happiness column is negative.
-
Include only the columns
name,
age,
favorite_food and
happiness
-
Sort the result by happiness with largest values on top.
We’re going to set this up as a
@chain block with a separate line for each step. First, setup the block and peform the merge using
innerjoin.
@chain simpsons begin
innerjoin(jobs, on = :name)
end
And this gives the merged dataset as above. Next, we’ll use
transform to create the
happiness column.
@chain data begin
innerjoin(simpsons, on = :A => :id)
transform([:age, :salary] = > ((a, s) -> @. s - 400*a) => :happiness)
end
This will give a new column on the DataFrame with the calculation above. Notice that there is a
missing value because the
salary is missing on that row. For the next step, we add a
dropmissing line like:
@chain data begin
innerjoin(simpsons, on = :A => :id)
transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
dropmissing(:happiness)
end
and the result of this is the following:
4×7 DataFrame
Row name age current_school_grade favorite_food job salary happiness
String Int64 Int64? String? String Int64? Int64
1 Homer 45 missing pork chops nuclear technician 50000 32000
2 Marge 42 missing casserole housewife 25000 8200
3 Apu 38 missing saag paneer store owner 60000 44800
4 Moe 59 missing peanuts bartender 15000 -8600
Next, we want to only include the rows with positive
happiness and we can add a
subset command. Including this in the
@chain block.
@chain data begin
innerjoin(simpsons, on = :A => :id)
transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
dropmissing(:happiness)
subset(:happiness => h -> h .> 0)
end
and this drops the last row from the output above. Note that we have use the broadcast on the greater than (
.>), however the
@. could have also been used.
For the next step, we are only considering a few of the columns, and we can do this with the
select command. Adding this to the
@chain block as
@chain data begin
innerjoin(simpsons, on = :A => :id)
transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
dropmissing(:happiness)
subset(:happiness => h -> h .> 0)
select(:name, :age, :favorite_food, :happiness)
end
4×7 DataFrame
Row name age favorite_food happiness
String Int64 String? Int64
1 Homer 45 pork chops 32000
2 Marge 42 casserole 8200
3 Apu 38 saag paneer 44800
and the last step sorts by
happiness with
rev = true to sort largest to smallest
@chain data begin
innerjoin(simpsons, on = :A => :id)
transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
dropmissing(:happiness)
subset(:happiness => h -> h .> 0)
select(:name, :age, :favorite_food, :happiness)
sort(:happines, rev = true)
end
3×7 DataFrame
Row name age favorite_food happiness
String Int64 String? Int64
1 Apu 38 saag paneer 44800
2 Homer 45 pork chops 32000
3 Marge 42 casserole 8200