We will see data analysis in the next part of this text. Often, when datasets get large, it is advantageous to store them in a database. We will show how to do this in a simple database system, SQLite, which is a SQL database that can be stored in a single file.
This section goes over how to create a SQLite database and let’s go ahead and do it with an example. If we have the forecast for some location over the next week with the following table:
Even though this is a small dataset, we will use it to illustrate how to create a database and put data into it. First, we need to load the SQLite package.
DBInterface.execute(db,
"""CREATE TABLE IF NOT EXISTS FORECAST
(DATE_ID INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
DATE TEXT NOT NULL,
HI_TEMP INTEGER NOT NULL,
LO_TEMP INTEGER NOT NULL,
PRECIP INTEGER NOT NULL,
COND TEXT NOT NULL
);""")
Each item in the () list a column in the database, with the name of the column, type and if you insist that there are values (cannot be missing). Here’s the details of each column.
This column is created as an index and can be thought of as the row in the database. The term PRIMARY KEY says that this is the key in the database and AUTOINCREMENT is used to say that this column will be created automatically. Both the key and the autoincrementing is often desirable for a database to help with both joining to other databases and indexing.
Before going on, let’s see if we can take a look at this database using a command line tool called sqlite3 1
Add some information to get this on mac/windows/linux
. Go to a terminal, or shell (for example, in VSCode, View then Terminal), make sure you are in the same directory that forecast.sqlite was created and type in sqlite3 forecast.sqlite.
If we enter .tables, this will show all of the database tables in the database. We only have a single one so FORECAST is the only one listed. To list the columns as names and other information, enter pragma table_info(FORECAST); and this returns
and the second column (separated by a |) is the name of the column and the third column is the datatype. The other three are related to indexing and skipped for now.
where TABLE is the name of the table and then the values VAL1, VAL2, ... are placed in the columns COL1, COL2, .... If you created a column with NOT NULL then a value will need to be written, but if not, this column can be missing. Text columns need to be put in surrounded by single quotes (this is a SQL/SQLITE requirement not a Julia one). Also, we have split the INSERT command over three lines just for readability. This single command can be broken up in any way desired. Because we wrote this over 3 times, we use the multiline version of string that are fenced with """.
Next, let’s look the data in the database to ensure it was read in correctly. First, let’s use the sqlite3 command line program first. Again, go to a terminal or shell and enter sqlite3 forecast.sqlite. To examine the entire table, we enter
We can do the same within Julia itself using the method DBInterface.execute, which is a general method to make database calls. However if we make a call that generates output, like
which seems to have information about the resulting table, like the column names and types, but nothing about the data itself. In short, this is an iterator (although it’s not very clear that it is) and passing this to a DataFrame will create one as in
Section34.4Larger Databases and Splitting over tables
The example above had just 9 lines in it and storing the data in a file and then importing the entire file is sufficient. When data gets larger, then databases are more helpful.
As an example, consider NBA game data. This data is available from various sources, but one source is Kaggle. If you don’t have an account on Kaggle, make one (it’s free) and then download the data from the link. The data comes as a zip file and you can unzip it to a directory called nba-data. There are a number of files in this directory, two of which we will use in this chapter: games.csv and teams.csv. The first file has information about each game played (date, teams, scores, etc.) and the second file has information about each team (name, city, abbreviation, etc.). There are over 26,000 games in the games.csv file alone.
The next steps assume that the zip file has been unzipped to a directory called nba-data in the current working directory and the two files games.csv and teams.csv are in this directory.
Note that the GAME_DATE_EST column in the games.csv file is a date, but SQLite does not have a date type, so we read it in as a string. This is done by passing a Dict to the types keyword argument of CSV.File.
If you work a lot with sqlite database or intend to, it is recommended to get a database browser, which will load in the database from a file and display it a bit like a spreadsheet. I have used the programs SQLPro for SQLite, Base and DB Browser for SQLite. The first two are Mac apps and have limited free versions, but the paid ones are more feature rich. The last one is free and open source, but not nearly as polished as the other ones. Once downloaded and installed, opening each of them allow one to open the database file and then will list all of the tables and the data inside.
You may notice that it’s hard to tell who played and who won each of the games. Instead of home and visitor teams, there is the HOME_TEAM_ID and VISITOR_TEAM_ID. These are numeric codes for each team. To find out which team is which, we can look at the TEAMS table.