Skip to main content

Chapter 32 Plotting Data using Makie

Many of the examples above have involved plotting functions. This section gives some overview of how to think about data. In particular, recall that data is generally either discrete (that is there are either a finite or countable number of possibilities) like categories or numerical data that can be counted or continuous like data that is measured with real numbers (inches of rain, heights of people, etc.) It’s important to understand the difference to understand plots. This chapter covers plotting data using primarily Makie
Recall that we covered plotting in Makie in Chapter 14. It’s important to have a grasp of the plotting basics including the backends. We will mainly be using CairoMakie here, but that is not required. Recall to set the CairoMakie backend, we need to do:
using CairoMakie
CairoMakie.activate!()
Makie.inline!(true)

Section 32.1 Plotting Continuous

We first start with plotting continuous data.

Subsection 32.1.1 Histograms

If one has univariate data
 1 
Recall that this means that for every observation, there is only one value associated with it.
which is continuous, the standard way to understand the distribution of the data is to do a histogram.

Subsection 32.1.2 Scatter Plots

We have already seen examples of plotting continuous data with the CO₂ data above. This data is the mean CO₂ level over each year and although the year seems like a discrete variable, time is actually continuous. Because of this, a scatter plot is a good way to present this data.

Section 32.2 Plotting Discrete Data in Makie

In contrast, let’s look at discrete data. I have done a lot of research recently using sports data and one project involved scoring in the National Basketball Association (NBA). Consider a season and looking at the number of points every team has scored. For the 2023-2024 season, if we consider the home and visiting teams, then here is the first few games:

Subsection 32.2.1 Barplots in Makie

We saw barplots in Chapter 19, but didn’t explain how to generate them. A barplot is a good plot to use if there is a sequence of discrete data. For example, the probability distribution function for rolling two dice seen in Chapter 18 resulted in the distribution:
Table 32.1. PDF of the sum of two dice
\(k\) \(P(X=k)\)
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
A bar plot of this data is plotted with the barplot command. The call barplot(x,y) for vectors x and y plot a bar located at each value of x with height y. The heights from the table above can be generated with [(6-abs(i-7))//36 for i = 2:12] and then
barplot(2:12,h)
which produces the plot
(for accessibility)
Figure 32.2.
and note that the horizontal axis has underisable tick marks. We’ll fix this building a Axis object and changing the xticks option. Additionally, I think it looks nice with small black borders around the bars and give it a title. If we run the following:
fig = Figure()
ax = Axis(fig[1,1],xticks=2:2:12, title="PDF of the sum of two dice")
barplot!(ax, 2:12,h, strokewidth = 0.5)
fig
which produces the following plot:
(for accessibility)
Figure 32.3.
If one desires for the bars to run horizontal, then use the option direction = :x on the barplot method. For example:
fig = Figure()
x = Axis(fig[1,1],yticks=2:2:12, title="PDF of the sum of two dice") 
barplot!(ax, 2:12,h, strokewidth = 0.5, direction = :x) 
fig
and note that we have switched the tick marks to yticks. This produces the plot
(for accessibility)
Figure 32.4.

Subsection 32.2.2 Grouped Barplots in Makie

Another common way to display data is to compare a set to another one. If the data would often be displayed as a bar plot, then to compare it we can use a grouped bar plot. A grouped bar plot can either be stacked (which the bars are stacked above each other) or dodged (in which the bars are side by side). Let’s say we have the data
h1 = [2, 6, 8, 7]
h2 = [3, 8, 5, 9]
x = [1, 2, 3, 4]
where we want to plot both h1 and h2 on the same axes. A stacked bar plot can be made with
fig = barplot(vcat(x,x),vcat(h1,h2), 
 stack = vcat(x,x), 
 color=repeat(1:2,inner=4),
 strokewidth = 0.5)
where we have used array concatenation explained in Section 6.11 and the repeat method from Section 6.12. First, the stack of the barplot explains that the data is to be stacked and how to stack the results. Without the color option, the bars are stacked, but can’t be distinguished (try it!). Also, again, typing this on multiple lines is mainly for clarity--the same results occur if this was a single line. The results are:
(for accessibility)
Figure 32.5.
I’m not a fan of the default colors that are used. One can choose any of the named colors. See Makie page on colors for more details. A common color grouping which works well for bar colors (that is there is contrast between colors) is the built-in Makie.wong_colors(). The above plot can be changed with
colors =Makie.wong_colors()
fig = barplot(vcat(x,x),vcat(h1,h2), 
 stack = vcat(x,x), 
 color=repeat(1:2,inner=4),
 strokewidth = 0.5)
which results in
(for accessibility)
Figure 32.6.
If you want to specify the colors, simply replace the first line above, with an array of named colors. For example,
colors = [:fuchsia, :steelblue]
fig = barplot(vcat(x,x),vcat(h1,h2), 
 stack = vcat(x,x), 
 color=repeat(1:2,inner=4),
 strokewidth = 0.5)
results in
(for accessibility)
Figure 32.7.
However, if the desire of plotting two sets of data is to compare them, often plotting them side by side is better. This can be accomplished with the dodge option as in
colors = Makie.wong_colors()
fig = barplot(vcat(x,x),vcat(h1,h2),
  dodge = repeat(1:2,inner=4),
  color=colors[repeat(1:2,inner=4)],
  strokewidth = 0.5)
resulting in
(for accessibility)
Figure 32.8.
Note that the dodge option determines the order of each group of bars for each x value. Although not shown, you can use more than two sets of bars for a group.

Subsection 32.2.3 Example of Realistic Barplot in Makie

To plot every score (which is discrete), let’s consider a bar plot in which the height of a bar at a given score is the number of games with that score. To generate this, we load in the nba2024.csv file. Note you will need to download/install CSV and DataFrames.
using CSV, DataFrames
nba_scores = CSV.read("nba.csv", DataFrame)
and this lists about 15 rows of the file. The top of the file looks like:
1319×5 DataFrame     1294 rows omitted
  Row   DATE        HOME_TEAM              HOME_SCORE  VISITOR_NAME        VISITOR_SCORE
        Date        String31               Int64       String31            Int64
  1     2023-10-24  Denver Nuggets         119         Los Angeles Lakers  107
  2     2023-10-24  Golden State Warriors  104         Phoenix Suns        108
  3     2023-10-25  Orlando Magic          116         Houston Rockets     86
We won’t go into what a DataFrame is, but in short it is a common data structure for working with data that comes in columns with common types. These work well with spreadsheets. The details of a DataFrame is presented starting in Chapter 31. To plot the number of home games with a given score (also called the score distribution), we use the counts function in the StatsBase package (so install it) and perform using StatsBase.
home_dist = counts(nba_scores.HOME_SCORE,70:160)
which returns a vector of the number of games with the score 70, 71, ..., 160. We can then plot the results with barplot using
barplot(70:160,home_dist)
which results in the plot
(for accessibility)
Figure 32.9.
Another interesting plot is to place bars side by side for the home and visitor scores. This is a little but possible with the barplot command. We just need to include all of the data together and then including a vector of grouping. The following is the code to do this:
colors = Makie.wong_colors()
barplot(
  repeat(70:160,2),
  vcat(home_dist,visitor_dist),
  dodge = repeat(1:2,inner=91),
  color = repeat(collect(colors[1:2]),inner=91),
  strokewidth=0.25
)
where this is one command but split onto lines for readability. The 2nd line is the horizontal axis which is just the same as the plot above except that we need to repeat it twice. The third line concatenates vertically (vcat) the two distributions. The fourth line explain how to group the data (dodge is used to include it side by side or stack is to stack it vertically). The color attribute (line 5) sets the color for each bar--again, this is needed to be a repeated vector. The result of this is
(for accessibility)
Figure 32.10.
More examples of plotting data can be found in Chapter 31 and other chapters in Part VII, which uses larger datasets and investigates how to gain insights into data from visualization.