Chapter 36 Split, Combine, Apply a DataFrame
Section 36.1 Split-Apply-Combine
A common situation with data analysis is to have a dataset and you want to compare means or standard deviations within a dataset. What needs to often happen is that you first split a dataset, do some analysis on each group then summarize. This is know as split-apply-combine. We will demonstrate this with an example. Let’s return to the
iris dataset that we loaded in the previous chapter. Recall that the RDatasets package must be loaded and then load the dataset with
iris = RDatasets.dataset("datasets","iris")
First, we will split the dataset by the
Species column using the groupby function in the @chain block as
@chain iris begin groupby(:Species) end
and the result shows two DataFrames (actually
GroupedDataFrames), the first and last. At the top of the output, it says GroupedDataFrame with 3 groups based on key: Species Even though we have ``split’’ the dataset, it is all stored in a single variable of type GroupedDataFrame. It shows part of the first group and part of the last group. This is mainly because of space for the output.
Next, we may want to compare the variables between these three groups. For example, to compare the mean of the
SepalLength variable, we use the combine function like:
@chain iris begin groupby(:Species) combine(:SepalLength => mean) end
and the result is
3×2 DataFrame
Row Species SepalLength_mean
Cat… Float64
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
If we want the number of rows in each set:
@chain iris begin groupby(:Species) combine(nrow) end
which shows that there are 50 rows in each subset. Notice that this is different than the previous example (mean of the SepalLength) in that no column is needed and the function
nrow is applied to the entire Grouped Data Frame.
We can also use
combine to produce many summarizing values. If we want the number of rows, the mean of the SepalLength, the standard deviation of the SepalWidth and the maximum of the PetalWidth, we can do this with one command as in
@chain iris begin groupby(:Species) combine(nrow, :SepalLength => mean, :SepalWidth => std, :PetalWidth => maximum) end
resulting in
3×5 DataFrame,
Row Species nrow SepalLength_mean SepalWidth_std PetalWidth_maximum
Cat… Int64 Float64 Float64 Float64
1 setosa 50 5.006 0.379064 0.6
2 versicolor 50 5.936 0.313798 1.8
3 virginica 50 6.588 0.322497 2.5
