Skip to main content

Chapter 36 Split, Combine, Apply a DataFrame

Section 36.1 Split-Apply-Combine

A common situation with data analysis is to have a dataset and you want to compare means or standard deviations within a dataset. What needs to often happen is that you first split a dataset, do some analysis on each group then summarize. This is know as split-apply-combine. We will demonstrate this with an example. Let’s return to the iris dataset that we loaded in the previous chapter. Recall that the RDatasets package must be loaded and then load the dataset with
iris = RDatasets.dataset("datasets","iris")
First, we will split the dataset by the Species column using the groupby function in the @chain block as
@chain iris begin
  groupby(:Species)
end
and the result shows two DataFrames (actually GroupedDataFrames), the first and last. At the top of the output, it says GroupedDataFrame with 3 groups based on key: Species Even though we have ``split’’ the dataset, it is all stored in a single variable of type GroupedDataFrame. It shows part of the first group and part of the last group. This is mainly because of space for the output.
Next, we may want to compare the variables between these three groups. For example, to compare the mean of the SepalLength variable, we use the combine function like:
@chain iris begin
  groupby(:Species)
  combine(:SepalLength => mean)
end
and the result is
3×2 DataFrame
Row  Species      SepalLength_mean
     Cat…         Float64
1    setosa       5.006
2    versicolor   5.936
3    virginica    6.588
If we want the number of rows in each set:
@chain iris begin
  groupby(:Species)
  combine(nrow)
end
which shows that there are 50 rows in each subset. Notice that this is different than the previous example (mean of the SepalLength) in that no column is needed and the function nrow is applied to the entire Grouped Data Frame.
We can also use combine to produce many summarizing values. If we want the number of rows, the mean of the SepalLength, the standard deviation of the SepalWidth and the maximum of the PetalWidth, we can do this with one command as in
@chain iris begin
  groupby(:Species)
  combine(nrow, :SepalLength => mean, :SepalWidth => std, :PetalWidth => maximum)
end
resulting in
3×5 DataFrame,
Row   Species     nrow     SepalLength_mean  SepalWidth_std  PetalWidth_maximum
      Cat…        Int64    Float64           Float64         Float64
1     setosa      50       5.006             0.379064        0.6
2     versicolor  50       5.936             0.313798        1.8
3     virginica   50       6.588             0.322497        2.5