Missing Data

Chapter 33 Missing Data

Objectives

Understand the Missing datatype.
🔗

🔗
Perform operations with missing data and determine how to find reasonable answers with them.
🔗

🔗
Handle datasets with missing data.
🔗

🔗

The Missing datatype was introduced in Section 21.5 as a special data type that has only a single value missing. This generally isn’t a very exciting datatype, but is however important in that many dataset have missing data and want to handle such values easily.

🔗

Section 33.1 Operations with Missing data

Before getting into datasets with missing data, let’s look at a simpler example. If we execute a expression like 1+missing, we get the result missing. In fact all of the standard mathematical functions and operations return missing if any of the arguments are missing. The primary purpose of the Missing datatype.

🔗

This corresponds to other Julia functions as well. Consider v=[1,2,missing,4,5,6,7]. If we want add these values with sum(v) then the result is missing since the sum of any number with missing results in missing.

🔗

One way to handle this is to delete the missing value and sum the remainders. We can use the deleteat! function (see Section 6.9). The following will find the sum of the non-missing values:

🔗

sum(deleteat!(v,3))

which returns 25. This isn’t ideal. First, the deleteat! function actually changes the array. We may want to keep the missing value so we know where they occur. Secondly, we need to know the indices of the missing values. Both of these can taken care of with the filter command.

🔗

If we have the vector w = [1,2,missing, 4,5,6,7,missing,9,10,11,missing], then apply the filter command with

🔗

filter(x-> !ismissing(x), w)

and notice that the result is [1,2, 4,5,6,7,9,10,11], which filters out the missing values. The ismissing function is used here and just returns true if the argument has the missing value. Then the sum function can be applied to this vector.

🔗

Alternatively, there is a skipmissing function. Applying this to w with skipmissing(w) results in

🔗

skipmissing(Union{Missing, Int64}[1, 2, missing, 4, 5, 6, 7, missing, 9, 10, 11, missing])

🔗

which doesn’t seem to do anything. However, the result is an iterator, that was covered in Chapter 22 and is a abstract version of an vector. The sum function can be applied directly to this as

🔗

sum(skipmissing(w))

and the result is 55. Alternatively, if you want the array of non-missing values, collect(skipmissing(w)) results in

🔗

[1, 2, 4, 5, 6, 7, 9, 10, 11]

🔗

Subsection 33.1.1 Why skipmissing is an iterator

If you just wanted the sum of the non-missing values in w, the function sum(skipmissing(w)) is probably all you need and you can ignore this section. However, let’s deep dive into this a bit because this shows an example of how Julia is designed for speed and efficiency.

🔗

First of all skipmissing can be applied to any other iterator such as vectors, arrays, dictionaries and more importantly as we will see below DataFrames. sum(skipmissing([1 missing 3; missing 5 6])) returns 15 as expected. Additionally, consider the dictionary:

🔗

D = Dict(
  "A" => 3,
  "B" => 11,
  "C" => missing,
  "D" => 9
)

if we want the values of D, we can enter values(D) which returns [11 missing 3 9]

Recall that the keys and values of a dictionary are not returned in order.

and if the sum of the values can be found with sum(skipmissing(values(D))) resulting in 23.

🔗

The other reason to make skipmissing an iterator is for efficiency. Let’s look at another example. The vector C will be a length of 1 million with the following definition:

🔗

C::Vector{Union{Missing,Int64}} = collect(1:1_000_000);

where we have specifically stated that C is either missing or Int64 because we want to fill 100 random values will missing. We’ll do this with:

🔗

for _ = 1:100
  i = rand(1:1_000_000)
  C[i] = missing
end

The size of C in memory is found with sizeof(C) to be 8000000, which makes sense in that each value is 8 bytes (64-bits).

🔗

If we defined C2 = skipmissing(C) and then determine its size with sizeof(C), we get 8. The information in C2 is a reference to C and takes only 8 bytes of memory.

🔗

Section 33.2 Missing Data in a Dataset

Let’s revisit the dataset called simpsons from the previous section which we entered directly with

🔗

simpsons = DataFrame(
  id = 1:2:13,
  name = ["Homer", "Marge", "Lisa", "Bart", "Maggie", "Apu", "Moe"],
  age = [45, 42, 8, 10, 1, 38, 59],
  salary = [50000, 25000, 10000, missing, missing, 45000, 3000],
  favorite_food = ["pork chops", "casserole", "salad", "hamburger", missing, "saag paneer", "peanuts"]
)

Notice that there are a number of missing values have been put in and recall that we first saw the Missing type in Section 21.5. We’ll notice something different about the data types. If we look at the salary column:

🔗

typeof(simpsons[!,:salary])

🔗

we see that the type is Vector{Union{Missing, Int64}} which means that the type of elements in the array are Union{Missing,Int64} which is julia’s way of saying that the type can be either Missing or Int64. Note that the column header for the last two headers have a "?" after the type. This is a shorthand for the same type.

🔗

One of the other commands that we saw in Chapter 31 was the describe function. In this case, describe(simpsons) results in

🔗

5×7 DataFrame
Row      variable  mean      min     median    max     nmissing   eltype
         Symbol    Union…    Any     Union…    Any     Int64   Type
1        id            7.0     1       7.0      13       0   Int64
2        name                Apu               Moe       0  String
3        age          29.0     1      38.0      59       0   Int64
4        salary    26600.0  3000   25000.0   50000       2   Union{Missing, Int64}
5        favorite_food           casserole      salad    1   Union{Missing, String}

🔗

Again, we see the types of the last two columns are possibly missing and the column labelled nmissing lists the number of missing values.

🔗

If we perform calculations on the DataFrame itself, consider

🔗

mean(simpsons[!,:salary])

🔗

this returns missing, which makes sense from the above discussion that nearly every operation with missing returns missing.

🔗

Instead, we may want to ignore the missing values, so we can use the skipmissing function.

🔗

s = skipmissing(simpsons[!,:salary])

🔗

and then

🔗

mean(s)

🔗

then it finds the mean of the non-missing values and returns 26600.0.

🔗

There is a way to reduce the DataFrame (similar to subset) which drops rows with missing data. For example

🔗

dropmissing(simpsons)

and the result is:

🔗

5×5 DataFrame
Row   id  name       age   salary  favorite_food
     Int64  String  Int64  Int64    String
1      1    Homer     45   50000    pork chops
2      3    Marge     42   25000    casserole
3      5    Lisa       8   10000    salad
4      11   Apu       38   45000    saag paneer
5      13   Moe       59    3000    peanuts

🔗

and note that the DataFrame has gotten smaller, but also the column types are missing the trailing ? indicating that they do not contain missing values.

🔗

One can also specify only to drop rows with missing values in a particular column, for example:

🔗

dropmissing(simpsons, :favorite_food)

results in

🔗

6×5 DataFrame
Row  id     name    age  salary  favorite_food
    Int64  String  Int64  Int64?  String
1     1     Homer    45  50000     pork chops
2     3     Marge    42  25000     casserole
3     5     Lisa      8  10000     salad
4     7     Bart     10  missing   hamburger
5    11     Apu      38  45000     saag paneer
6    13     Moe      59   3000     peanuts

🔗

and 1 row has been dropped, however there is still a missing value in the salary column.

🔗

Section 33.3 Handling Missing Data in a CSV File

This file contains a larger dataset with Simpson’s characters.

🔗

Prev Top Next