Section 33.2 Missing Data in a Dataset
Let’s revisit the dataset called
simpsons from the previous section which we entered directly with
simpsons = DataFrame(
id = 1:2:13,
name = ["Homer", "Marge", "Lisa", "Bart", "Maggie", "Apu", "Moe"],
age = [45, 42, 8, 10, 1, 38, 59],
salary = [50000, 25000, 10000, missing, missing, 45000, 3000],
favorite_food = ["pork chops", "casserole", "salad", "hamburger", missing, "saag paneer", "peanuts"]
)
Notice that there are a number of missing values have been put in and recall that we first saw the
Missing type in
Section 21.5. We’ll notice something different about the data types. If we look at the salary column:
typeof(simpsons[!,:salary])
we see that the type is
Vector{Union{Missing, Int64}} which means that the type of elements in the array are
Union{Missing,Int64} which is julia’s way of saying that the type can be either
Missing or
Int64. Note that the column header for the last two headers have a "?" after the type. This is a shorthand for the same type.
One of the other commands that we saw in
Chapter 31 was the
describe function. In this case,
describe(simpsons) results in
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 id 7.0 1 7.0 13 0 Int64
2 name Apu Moe 0 String
3 age 29.0 1 38.0 59 0 Int64
4 salary 26600.0 3000 25000.0 50000 2 Union{Missing, Int64}
5 favorite_food casserole salad 1 Union{Missing, String}
Again, we see the types of the last two columns are possibly missing and the column labelled
nmissing lists the number of missing values.
If we perform calculations on the DataFrame itself, consider
mean(simpsons[!,:salary])
this returns
missing, which makes sense from the above discussion that nearly every operation with missing returns missing.
Instead, we may want to ignore the missing values, so we can use the
skipmissing function.
s = skipmissing(simpsons[!,:salary])
then it finds the mean of the non-missing values and returns
26600.0.
There is a way to reduce the DataFrame (similar to
subset) which drops rows with missing data. For example
5×5 DataFrame
Row id name age salary favorite_food
Int64 String Int64 Int64 String
1 1 Homer 45 50000 pork chops
2 3 Marge 42 25000 casserole
3 5 Lisa 8 10000 salad
4 11 Apu 38 45000 saag paneer
5 13 Moe 59 3000 peanuts
and note that the DataFrame has gotten smaller, but also the column types are missing the trailing ? indicating that they do not contain missing values.
One can also specify only to drop rows with missing values in a particular column, for example:
dropmissing(simpsons, :favorite_food)
6×5 DataFrame
Row id name age salary favorite_food
Int64 String Int64 Int64? String
1 1 Homer 45 50000 pork chops
2 3 Marge 42 25000 casserole
3 5 Lisa 8 10000 salad
4 7 Bart 10 missing hamburger
5 11 Apu 38 45000 saag paneer
6 13 Moe 59 3000 peanuts
and 1 row has been dropped, however there is still a
missing value in the
salary column.