Skip to main content

Chapter 35 Manipulating DataFrames

A typically workflow with a dataframe is to start with some data, manipulate the data, perform some calculations, perhaps plot the result. In this section, we will see how to perform calculations (as known as transforming the dataset) as well as a very standard split-transform-combine operations, which takes a dataset, splits it into multiple sets based on some characteristics, transform the smaller sets and then combine the results.

Section 35.1 Subsets of DataFrames

The typical subsets of a DataFrame are either limiting the columns or the rows in some sense. Let’s return to the very simple DataFrame from the previous chapter.
data = DataFrame(
  A = 1:2:13,
  B = ["M", "F", "F", "M", "X", "F", "M"],
  C = [3.0, 2.5, pi, -2.3, 1/3, 56, 100],
  D = [(-1)^n//n for n=1:7]
)

Subsection 35.1.1 Subsets of Columns

We can select only the columns A and D in the following way:
 data[:,[:A,:D]] 
which returns
7×2 DataFrame
  Row  A      D
       Int64  Rational…  
  1    1      -1//1
  2    3      1//2
  3    5      -1//3
  4    7      1//4
  5    9      -1//5
  6    11     1//6
  7    13     -1//7
Alternatively, we can (and generally will) use the select command, which is in the DataFrames package. The follow gives the same results as above.
select(data, :A, :D)
Lastly, we are going to typically put all of these command in a @chain block, like the following:
@chain data begin
  select(:A, :D)
end
which also produces the same result. For this example, this is overkill because it is just a single command on the variable data and using @chain generally only pays off when you apply many commands together.
The Not and Between functions are helpful as well for select columns. For example,
@chain data begin
  select(Not(:B))
end
returns the columns A, C and D. The command
@chain data begin
  select(Between(:B,:D))
end
returns the last 3 columns of the DataFrame. There are other ways of subsetting by columns including passing in a Regular Expression (see Chapter 26√), which can also be helpful.

Section 35.2 Filtering (or Subsetting) Rows

A common method with DataFrames is to filter the rows by some condition. If you want to return the first few rows, the last few rows or a sequence of them, we can use the same technique from Chapter 6. For example, if we only want the first three rows, use data[1:3,:] and if we only want rows 5 to the end, use data[5:end,:]. If we want the odd rows, then data[1:2:end,:].
The first and last commands from the DataFrames package is also helpful. The first 3 rows can be found with first(data, 3) and similarly, the last 4 rows can be returned with last(data, 4). Use of these is nice within a @chain block in that the previous technique is a bit more difficult.
There is a filter
 1 
The syntax of the filter command is similar to that for vectors and not the same as the rest of the other functions in Dataframes, so subset is preferred. With effort, however, filter can be used with DataFrames.
command; however, we will use the subset function instead. We’ll take a look at a number of examples of this within a @chian block. To begin with, the following returns all rows where values in the A column is less than 10.
@chain data begin
  subset(:A => a-> a .< 10)
end
and this returns
5×4 DataFrame
  Row  A      B       C         D
       Int64  String  Float64   Rational…
  1    1      M       3.0       -1//1
  2    3      F       2.5       1//2
  3    5      F       3.14159   -1//3
  4    7      M       -2.3      1//4
  5    9      X       0.333333  -1//5
There is a lot packed into this command. First of all, notice that we want to subset on the A column so we use :A => the rest of this is a function and we use an anonymous function that returns a boolean. Actually, it is a vector of booleans. Recall that select a column of a DataFrame is a vector and select expects a boolean vector to do the filtering. This is which there is a .< instead of a <. Remove the . and you’ll see that there is an error.
For our next example, we select all rows where column B equals "F". This can be accomplished with
@chain data begin
  subset(:B => b-> b .== "F")
end
which returns
5×4 DataFrame
  Row  A      B       C         D
       Int64  String  Float64   Rational…
  1    3      F       2.5       1//2
  2    5      F       3.14159   -1//3
  3    11     F       56.0      1//6
We can also subset by combinations of columns. For the next example, we’ll find where the product of columns C and D are greater than 2. We do this with
@chain data begin
  subset([:C, :D] => (c, d) -> c .* d .> 2)
end
which returns
5×4 DataFrame
  Row  A      B       C         D
       Int64  String  Float64   Rational…
  1    11     F       56.0      1//6
The broadcasting notation can be a bit tricky although nearly always one can write a . in front of operators or append a . to the end of function names. See Section 6.5 for more details. Recall that the macro @. will take an expression and vectorize (or broadcast it). Entering
@chain data begin
  subset([:C, :D] => (c, d) -> @. c * d > 2)
end
Returns the same result as above. Also, in the DataFrames package, there is a function called ByRow that will perform a row-by-row version of a function which is often needed. The following
@chain data begin
  subset([:C, :D] => ByRow((c, d) -> c * d > 2))
end
also returns the same DataFrame.

Check Your Understanding 35.1.

Using the iris dataset, find all setosa irises (look at the Species column) with SepalWidth larger than 4. How many are there in the dataset?

Section 35.3 Transforming DataFrames

Another important technique needed with DataFrames is that of a transformation. In general, this means that a new column is created that is a function of one or more columns. There are two commands using DataFrames that transform dataframes, select and transform. The first is used to create a new DataFrame with generally only one or more transformed columns. The transform method includes the original dataframe with the extra columns defined as the transformation. We’ll see the difference here.
If we generate a new DataFrame which is just the square of the A column as
@chain data begin
  select(:A => a -> a.^2)
end
and this returns
7×1 DataFrame
  Row  A_function
       Int64
  1    1
  2    9
  3    25
  4    49
  5    81
  6    121
  7    169
Notice that the only column in the resulting DataFrame is the new column and it has the column name formed by appending _function to the column name. We can give the column a better name setting this with the fat arrow => such as
@chain data begin
  select(:A => (a -> a.^2) => :Asq)
end
and now the column has the name Asq. Note that the parentheses around the anonymous function is needed due to order of operations. Remove them to see the effect.
In contrast if we use the transform method, then the original matrix is preserved and the new columns is added at the end.
@chain data begin
  transform(:A => (a -> a.^2) => :Asq)
end
results in
7×5 DataFrame
  Row  A      B       C         D          Asq
       Int64  String  Float64   Rational…  Int64  
  1    1      M       3.0       -1//1      1
  2    3      F       2.5        1//2       9
  3    5      F       3.14159   -1//3      25
  4    7      M       -2.3       1//4       49
  5    9      X       0.333333  -1//5      81
  6    11     F       56.0       1//6       121
  7    13     M       100.0     -1//7      169
Both select and transform can use multiple columns. The following example, takes the product between columns C and D.
@chain data begin
  select([:C, :D] => ((c,d) -> @. c*d) => :prod)
end
where the macro @. has been used instead of using .*. This returns
7×1 DataFrame
  Row  prod
       Float64
  1    -3.0
  2    1.25
  3    -1.0472
  4    -0.575
  5    -0.0666667
  6    9.33333
  7    -14.2857
Note: there are also commands select and transform that perform the operations within the given dataframe. This can be destructive, but can save a lot of computational time for large DataFrames. It is recommended to stay with the non mutating (without the !) versions unless you need to save time/memory.

Check Your Understanding 35.2.

Using the iris dataframe produce a new column called area which is the area of a petal using the PetalLength and PetalWidth variables and the area of an ellipse.

Section 35.4 Sorting DataFrames

The DataFrames package also provides easy ways to sort. If we want to sort by the C column, we can do
sort(data, :C)
or staying within a @chain block
@chain data begin
  sort(:C)
end
and the results are
7×4 DataFrame
  Row  A      B       C         D
       Int64  String  Float64   Rational…
  1    7      M       -2.3      1//4
  2    9      X       0.333333  -1//5
  3    3      F       2.5       1//2
  4    1      M       3.0       -1//1
  5    5      F       3.14159   -1//3
  6    11     F       56.0      1//6
  7    13     M       100.0     -1//7
where the datatype determines the way to sort. Note that the rows are preserved. This is different than when we saw in Chapter 6 which sorted every column in either increasing or decreasing order. In the case of sorting by column C, it is numerically in increasing order. If we want the reverse:
@chain data begin
  sort(:C, rev = true)
end
Julia will detect the datatype automatically. If we sort by column B, via
@chain data begin
  sort(:B)
end
the result is
7×4 DataFrame
  Row  A      B       C         D
       Int64  String  Float64   Rational…
  1    3      F       2.5        1//2
  2    5      F       3.14159   -1//3
  3    11     F       56.0       1//6
  4    1      M       3.0       -1//1
  5    7      M       -2.3       1//4
  6    13     M       100.0     -1//7
  7    9      X       0.333333  -1//5
which sorts lexiographically automatically. Notice that the other items are not sorted or kept in the same order as original. If we want to sort by another column, say :D, we can do
@chain data begin
  sort([:B,:D])
end
where the column names are placed inside an array. The results are
7×4 DataFrame
Row  A      B       C        D
     Int64  String  Float64  Rational…
1     5     F       3.14159   -1//3
2    11     F       56.0       1//6
3     3     F       2.5        1//2
4     1     M       3.0       -1//1
5    13     M       100.0     -1//7
6     7     M       -2.3       1//4
7     9     X       0.333333  -1//5
And notice that within rows of constant B (like when it is "F"), the rows are then sorted by column D.

Section 35.5 Joining DataFrames

Often, data is split between different tables (or files) and they need to be joined before doing analysis. Let’s revisit the dataset from Chapter 31:
simpsons = DataFrame(
  name=["Homer","Marge","Lisa","Bart","Maggie","Apu","Moe", "Milhouse", "Patty"],
  age =[45,42,8,10,1,38,59, 10, 46],
  current_school_grade = [missing, missing, 2, 4, missing, missing, missing, 4, missing],
  favorite_food = ["pork chops","casserole","salad","hamburger",missing,"saag paneer","peanuts", missing, "Lady Laramie 100s"]
)
and a second dataset
jobs = DataFrame(
  name = ["Homer","Marge","Apu","Moe", "Wiggam"],
  job = ["nuclear technician", "housewife", "store owner", "bartender", "police chief"],
  salary = [50_000, 25_000, 60_000, 15_000, 75_000]
)
and we want to join these two together. In order to join two datasets, one of the columns in each dataframe needs to match. In this case, the column that makes most sense is the name column, which is common to both. For this example, we’ll use the innerjoin command and the keyword attributed on must be used with the format col1 => col2 , where col1 is in the first DataFrame and col2 in the second one. The column that the sets are joined on are called the key of the DataFrame. The innerjoin command returns all rows in which the key is common to both datasets. The inner part is like the intersection of two sets. If we perform the following:
innerjoin(data, simpsons, on = :name => :name)
and the result will be
5×6 DataFrame
Row  name    age     current_school_grade  favorite_food     job                  salary
     String  Int64   Int64?                String?           String               Int64
1    Homer   45      missing               pork chops        nuclear technician   50000
2    Marge   42      missing               casserole         housewife            25000
3    Apu     38      missing               saag              paneer  store owner  60000
4    Moe     59      missing               peanuts           bartender            15000
5    Patty   46      missing               Lady Laramie 100s missing	             missing
Note: in this case, because the column name on the the two dataframes are the same, you can just do on = :name.
You may be surprised that there are only 5 rows in this DataFrame, however, recall that innerjoin merges two DataFrames, by including only where the name column is in both of the two DataFrames. There are other ways to join Dataframes, and a summary is:
innerjoin
the output contains rows for values of the key that exist in all passed data frames.
leftjoin
the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
rightjoin
the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
outerjoin
the output contains rows for values of the key that exist in any of the passed data frames.
semijoin
Like an inner join, but output is restricted to columns from the first (left) argument.
antijoin
The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
crossjoin
The output is the cartesian product of rows from all passed data frames.
In the exmaple above, suppose we want a DataFrame with all keys (names). In this case, we will use the outerjoin command. If we evaluate:
outerjoin(simpsons, jobs, on = :name)
then the result is
9×6 DataFrame
Row  name      age      current_school_grade  favorite_food      job                 salary
     String    Int64?   Int64?                String?            String?             Int64?
1    Homer     45       missing               pork chops         nuclear technician  50000
2    Marge     42       missing               casserole          housewife           25000
3    Apu       38       missing               saag paneer        store owner         60000
4    Moe       59       missing               peanuts            bartender           15000
5    Patty     46       missing               Lady Laramie 100s	missing             missing
6    Lisa      8        2                     salad              missing             missing
7    Bart      10       4                     hamburger          missing             missing
8    Maggie    1        missing               missing            missing             missing
9    Milhouse  10       4                     missing            missing             missing
10   Wiggam    missing  missing               missing            police chief        75000

Section 35.6 Summarizing Data

Generally one wants to summarize a large dataset to a few numbers to get a sense or for comparison purposes. The describe does this, but in a fixed way. A more general function to do this is called the combine function. Consider the following:
combine(data, :C => mean, :D => mean, :C=>std, :C => length)
which returns
1×4 DataFrame
  Row  C_mean    D_mean      C_std    C_length
       Float64   Rational…   Float64  Int64
1      23.2393   -319//2940  39.5518  7
And this returns a new dataframe with 1 row and 4 columns (mean of C, mean of D, standard deviation of C and the number of rows in C).
We will also use combine after splitting a dataset for comparison purposes. We’ll do this below in Section 36.1

Section 35.7 Multiple Steps on a DataFrame

All of the previous examples in this chapter show a single step on a DataFrame. However, typically this is not how operations on DataFrames are not done. There are multiple steps. The next section will show this for a particular sequence of steps, but before moving onto that, let’s look at a series of steps. Consider the following steps:
  1. Merge/join the simpsons and jobs DataFrames as shown above.
  2. Make a new column that is the salary column minus 400 times the age. Call it happiness.
  3. Drop any rows in which the happiness is missing.
  4. Filter out any rows with the happiness column is negative.
  5. Include only the columns name, age, favorite_food and happiness
  6. Sort the result by happiness with largest values on top.
We’re going to set this up as a @chain block with a separate line for each step. First, setup the block and peform the merge using innerjoin.
@chain simpsons begin
  innerjoin(jobs, on = :name)
end
And this gives the merged dataset as above. Next, we’ll use transform to create the happiness column.
@chain data begin
  innerjoin(simpsons, on = :A => :id)
  transform([:age, :salary] =  > ((a, s) -> @. s - 400*a) => :happiness)
end
This will give a new column on the DataFrame with the calculation above. Notice that there is a missing value because the salary is missing on that row. For the next step, we add a dropmissing line like:
@chain data begin
  innerjoin(simpsons, on = :A => :id)
  transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
  dropmissing(:happiness)
end
and the result of this is the following:
4×7 DataFrame
Row  name    age    current_school_grade  favorite_food  job                 salary  happiness
     String  Int64  Int64?                String?        String              Int64?  Int64
1    Homer   45     missing               pork chops     nuclear technician  50000   32000
2    Marge   42     missing               casserole      housewife           25000   8200
3    Apu     38     missing               saag paneer    store owner         60000   44800
4    Moe     59     missing               peanuts        bartender           15000   -8600
Next, we want to only include the rows with positive happiness and we can add a subset command. Including this in the @chain block.
@chain data begin
  innerjoin(simpsons, on = :A => :id)
  transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
  dropmissing(:happiness)
  subset(:happiness => h -> h .> 0)
end
and this drops the last row from the output above. Note that we have use the broadcast on the greater than (.>), however the @. could have also been used.
For the next step, we are only considering a few of the columns, and we can do this with the select command. Adding this to the @chain block as
@chain data begin
  innerjoin(simpsons, on = :A => :id)
  transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
  dropmissing(:happiness)
  subset(:happiness => h -> h .> 0)
  select(:name, :age, :favorite_food, :happiness)
end
which results in
4×7 DataFrame
Row  name    age    favorite_food  happiness
     String  Int64  String?        Int64
1    Homer   45     pork chops     32000
2    Marge   42     casserole      8200
3    Apu     38     saag paneer    44800
and the last step sorts by happiness with rev = true to sort largest to smallest
@chain data begin
  innerjoin(simpsons, on = :A => :id)
  transform([:C, :D, :salary] => ((c, d, s) -> @. s + c*d) => :happiness)
  dropmissing(:happiness)
  subset(:happiness => h -> h .> 0)
  select(:name, :age, :favorite_food, :happiness)
  sort(:happines, rev = true)
end
and this result is
3×7 DataFrame
Row  name    age    favorite_food  happiness
     String  Int64  String?        Int64
1    Apu     38     saag paneer    44800
2    Homer   45     pork chops     32000
3    Marge   42     casserole      8200

Section 35.8 Summary of DataFrame Transformations

The examples in this chapter are relatively simple. Chapter 38 and Chapter 39 show more realistic data sets with a large census dataset and a olympic event dataset respectively, however use all of the techniques listed above.