Using Databases

Chapter 34 Using Databases

We will see data analysis in the next part of this text. Often, when datasets get large, it is advantageous to store them in a database. We will show how to do this in a simple database system, SQLite, which is a SQL database that can be stored in a single file.

🔗

Section 34.1 Creating a SQLite Database

This section goes over how to create a SQLite database and let’s go ahead and do it with an example. If we have the forecast for some location over the next week with the following table:

🔗

Table 34.1. Weather Forecast December 2024

🔗

Date	High (°F)	Low (°F)	Precip (%)	Condition
2024-12-07	37	23	0	Partly Cloudy
2024-12-08	44	28	60	Snow
2024-12-09	44	32	75	Rain
2024-12-10	44	39	70	Rain
2024-12-11	58	40	80	Rain
2024-12-12	48	28	60	Rain
2024-12-13	37	25	0	Sunny
2024-12-14	40	24	0	Sunny
2024-12-15	45	30	0	Cloudy

For convenience, we can represent this data in Julia as a vector of named tuples.

🔗

forecast = [
  (date = "2024-12-07", high = 37, low = 23, precip = 0, cond = "Partly Cloudy"),
  (date = "2024-12-08", high = 44, low = 28, precip = 60, cond = "Snow"),
  (date = "2024-12-09", high = 44, low = 32, precip = 75, cond = "Rain"),
  (date = "2024-12-10", high = 44, low = 39, precip = 70, cond = "Rain"),
  (date = "2024-12-11", high = 58, low = 40, precip = 80, cond = "Rain"),
  (date = "2024-12-12", high = 48, low = 28, precip = 60, cond = "Rain"),
  (date = "2024-12-13", high = 37, low = 25, precip = 0, cond = "Sunny"),
  (date = "2024-12-14", high = 40, low = 24, precip = 0, cond = "Sunny"),
  (date = "2024-12-15", high = 45, low = 30, precip = 0, cond = "Cloudy"),
]

Even though this is a small dataset, we will use it to illustrate how to create a database and put data into it. First, we need to load the SQLite package.

🔗

using SQLite

To put this into a SQLite database, we first create one with

🔗

db = SQLite.DB("forecast.sqlite")

which resulted in the following:

🔗

SQLite.DB("forecast.sqlite")

🔗

showing that it succeeded and that there is a file called "forecast.sqlite" that was created (in the working directory).

🔗

This is just an empty database. To do anything with it, a table needs to be created in the database. We can do this with the following:

🔗

DBInterface.execute(db, 
  """CREATE TABLE IF NOT EXISTS FORECAST
  (DATE_ID  INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
   DATE     TEXT      NOT NULL,
   HI_TEMP  INTEGER   NOT NULL,
   LO_TEMP  INTEGER   NOT NULL,
   PRECIP   INTEGER   NOT NULL,
   COND     TEXT      NOT NULL
  );""")

Each item in the () list a column in the database, with the name of the column, type and if you insist that there are values (cannot be missing). Here’s the details of each column.

🔗

DATE_ID 🔗: This column is created as an index and can be thought of as the row in the database. The term PRIMARY KEY says that this is the key in the database and AUTOINCREMENT is used to say that this column will be created automatically. Both the key and the autoincrementing is often desirable for a database to help with both joining to other databases and indexing.
🔗
DATE 🔗: The date will be stored as a string (TEXT). Although other databases have a native Date type, SQLite does not and type TEXT is the best.
🔗
HI_TEMP 🔗: The forecast high temperature of the day as a integer.
🔗
LO_TEMP 🔗: The forecast low temperature of the day as a integer.
🔗
PRECIP 🔗: The forecast percent chance of precipitation (rain or snow) as an integer.
🔗
COND 🔗: The forecast conditions (Sunny, Partly Cloudy, etc.) as a string (TEXT).
🔗

🔗

Before going on, let’s see if we can take a look at this database using a command line tool called sqlite3

Add some information to get this on mac/windows/linux

. Go to a terminal, or shell (for example, in VSCode, View then Terminal), make sure you are in the same directory that forecast.sqlite was created and type in sqlite3 forecast.sqlite.

🔗

You will see the following (or similar depending on version):

🔗

SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite>

🔗

If we enter .tables, this will show all of the database tables in the database. We only have a single one so FORECAST is the only one listed. To list the columns as names and other information, enter pragma table_info(FORECAST); and this returns

🔗

1|DATE|TEXT|1||0
2|HI_TEMP|INTEGER|1||0
3|LO_TEMP|INTEGER|1||0
4|PRECIP|INTEGER|1||0
5|COND|TEXT|1||0

🔗

and the second column (separated by a |) is the name of the column and the third column is the datatype. The other three are related to indexing and skipped for now.

🔗

Section 34.2 Adding data to a database

Next, we want to write data to the database. We will use a SQL command to do this

🔗

for fc in forecast
  DBInterface.execute(
    db,
    """INSERT INTO FORECAST 
      (DATE, HI_TEMP, LO_TEMP, PRECIP, COND) VALUES (?, ?, ?, ?, ?)""", values(fc))
end

where we have used the INSERT command to do this. INSERT is more like an append command, but the general way this works is

🔗

INSERT INTO TABLE (COL1, COL2, ...) VALUE (VAL1, VAL2, ...);

🔗

where TABLE is the name of the table and then the values VAL1, VAL2, ... are placed in the columns COL1, COL2, .... If you created a column with NOT NULL then a value will need to be written, but if not, this column can be missing. Text columns need to be put in surrounded by single quotes (this is a SQL/SQLITE requirement not a Julia one). Also, we have split the INSERT command over three lines just for readability. This single command can be broken up in any way desired. Because we wrote this over 3 times, we use the multiline version of string that are fenced with """.

🔗

Section 34.3 Examining the Database in a Database

Next, let’s look the data in the database to ensure it was read in correctly. First, let’s use the sqlite3 command line program first. Again, go to a terminal or shell and enter sqlite3 forecast.sqlite. To examine the entire table, we enter

🔗

SELECT * FROM FORECAST;

and the result of this is:

🔗

1|2024-12-07|37|23|0|Partly Cloudy
2|2024-12-08|44|28|60|Snow
3|2024-12-09|44|32|75|Rain
4|2024-12-10|44|39|70|Rain
5|2024-12-11|58|40|80|Rain
6|2024-12-12|48|28|60|Rain
7|2024-12-13|37|25|0|Sunny
8|2024-12-14|40|24|0|Sunny
9|2024-12-15|45|30|0|Cloudy

🔗

We can do the same within Julia itself using the method DBInterface.execute, which is a general method to make database calls. However if we make a call that generates output, like

🔗

DBInterface.execute(db, "SELECT * FROM FORECAST")

🔗

the result is

🔗

SQLite.Query{false}(SQLite.Stmt(SQLite.DB("forecast.sqlite"), Base.RefValue{Ptr{SQLite.C.sqlite3_stmt}}(Ptr{SQLite.C.sqlite3_stmt} @0x00000001380f3b80), Dict{Int64, Any}()), Base.RefValue{Int32}(100), [:DATE_ID, :DATE, :HI_TEMP, :LO_TEMP, :PRECIP, :COND], Type[Union{Missing, Int64}, Union{Missing, String}, Union{Missing, Int64}, Union{Missing, Int64}, Union{Missing, Int64}, Union{Missing, String}], Dict(:COND   => 6, :HI_TEMP   => 3, :DATE   => 2, :PRECIP   => 5, :LO_TEMP   => 4, :DATE_ID   => 1), Base.RefValue{Int64}(0))

🔗

which seems to have information about the resulting table, like the column names and types, but nothing about the data itself. In short, this is an iterator (although it’s not very clear that it is) and passing this to a DataFrame will create one as in

🔗

DataFrame(DBInterface.execute(db, "SELECT * FROM FORECAST"))

now returns the expected table in a DataFrame:

🔗

9×6 DataFrame
Row  DATE_ID  DATE        HI_TEMP  LO_TEMP  PRECIP  COND
     Int64    String      Int64    Int64    Int64   String
1    1        2024-12-07  37       23       0       Partly Cloudy
2    2        2024-12-08  44       28       60      Snow
3    3        2024-12-09  44       32       75      Rain
4    4        2024-12-10  44       39       70      Rain
5    5        2024-12-11  58       40       80      Rain
6    6        2024-12-12  48       28       60      Rain
7    7        2024-12-13  37       25       0       Sunny
8    8        2024-12-14  40       24       0       Sunny
9    9        2024-12-15  45       30       0       Cloudy

🔗

Section 34.4 Larger Databases and Splitting over tables

The example above had just 9 lines in it and storing the data in a file and then importing the entire file is sufficient. When data gets larger, then databases are more helpful.

🔗

As an example, consider NBA game data. This data is available from various sources, but one source is Kaggle. If you don’t have an account on Kaggle, make one (it’s free) and then download the data from the link. The data comes as a zip file and you can unzip it to a directory called nba-data. There are a number of files in this directory, two of which we will use in this chapter: games.csv and teams.csv. The first file has information about each game played (date, teams, scores, etc.) and the second file has information about each team (name, city, abbreviation, etc.). There are over 26,000 games in the games.csv file alone.

🔗

The next steps assume that the zip file has been unzipped to a directory called nba-data in the current working directory and the two files games.csv and teams.csv are in this directory.

🔗

First, we create a new database file called nba.sqlite with

🔗

db = SQLite.DB("nba.sqlite")

Next, we load in the two CSV files into two tables called GAMES and TEAMS with

🔗

SQLite.load!(CSV.File("nba-data/games.csv", types=Dict(:GAME_DATE_EST => String)), db, "GAMES")
SQLite.load!(CSV.File("nba-data/teams.csv"), db, "TEAMS")

Note that the GAME_DATE_EST column in the games.csv file is a date, but SQLite does not have a date type, so we read it in as a string. This is done by passing a Dict to the types keyword argument of CSV.File.

🔗

Note 34.2.

If you work a lot with sqlite database or intend to, it is recommended to get a database browser, which will load in the database from a file and display it a bit like a spreadsheet. I have used the programs SQLPro for SQLite, Base and DB Browser for SQLite. The first two are Mac apps and have limited free versions, but the paid ones are more feature rich. The last one is free and open source, but not nearly as polished as the other ones. Once downloaded and installed, opening each of them allow one to open the database file and then will list all of the tables and the data inside.

🔗

Let’s just take a look at the games from the database now. The following will load all of the games:

🔗

DataFrame(DBInterface.execute(db, "SELECT * FROM GAMES"))

The top of the result is:

🔗

26651×21 DataFrame26626 rows omitted
Row   GAME_DATE_EST   GAME_ID   GAME_STATUS_TEXT   HOME_TEAM_ID   VISITOR_TEAM_ID   SEASON   TEAM_ID_home   PTS_home   FG_PCT_home   FT_PCT_home   FG3_PCT_home   AST_home   REB_home   TEAM_ID_away   PTS_away   FG_PCT_away   FT_PCT_away   FG3_PCT_away   AST_away   REB_away   HOME_TEAM_WINS
String   Int64   String   Int64   Int64   Int64   Int64   Float64?   Float64?   Float64?   Float64?   Float64?   Float64?   Int64   Float64?   Float64?   Float64?   Float64?   Float64?   Float64?   Int64
1   2022-12-22   22200477   Final   1610612740   1610612759   2022   1610612740   126.0   0.484   0.926   0.382   25.0   46.0   1610612759   117.0   0.478   0.815   0.321   23.0   44.0   1
2   2022-12-22   22200478   Final   1610612762   1610612764   2022   1610612762   120.0   0.488   0.952   0.457   16.0   40.0   1610612764   112.0   0.561   0.765   0.333   20.0   37.0   1
3   2022-12-21   22200466   Final   1610612739   1610612749   2022   1610612739   114.0   0.482   0.786   0.313   22.0   37.0   1610612749   106.0   0.47   0.682   0.433   20.0   46.0   1

🔗

You may notice that it’s hard to tell who played and who won each of the games. Instead of home and visitor teams, there is the HOME_TEAM_ID and VISITOR_TEAM_ID. These are numeric codes for each team. To find out which team is which, we can look at the TEAMS table.

🔗

Check Your Understanding 34.3.

Print out the contents of the TEAMS table from the database.

🔗

Prev Top Next