23 EDA: Handling Missing Data
We can now move on to a very important aspect of data preparation and transformation: how to deal with missing data? By missing data we mean values that are unrecorded, unknown or unspecified in a dataset. We saw an example of this when we looked at the tidy unit. Here is the tidy weather dataset again:
## # A tibble: 22 x 35
## id year month element d1 d2 d3
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 MX17004 2010 1 tmax NA NA NA
## 2 MX17004 2010 1 tmin NA NA NA
## 3 MX17004 2010 2 tmax NA 27.3 24.1
## 4 MX17004 2010 2 tmin NA 14.4 14.4
## 5 MX17004 2010 3 tmax NA NA NA
## 6 MX17004 2010 3 tmin NA NA NA
## 7 MX17004 2010 4 tmax NA NA NA
## 8 MX17004 2010 4 tmin NA NA NA
## 9 MX17004 2010 5 tmax NA NA NA
## 10 MX17004 2010 5 tmin NA NA NA
## # … with 12 more rows, and 28 more variables:
## # d4 <dbl>, d5 <dbl>, d6 <dbl>, d7 <dbl>,
## # d8 <dbl>, d9 <lgl>, d10 <dbl>, d11 <dbl>,
## # d12 <lgl>, d13 <dbl>, d14 <dbl>, d15 <dbl>,
## # d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
## # d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>,
## # d24 <lgl>, d25 <dbl>, d26 <dbl>, d27 <dbl>,
## # d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>
And the result of tidying this dataset:
## # A tibble: 341 x 6
## id year month day tmax tmin
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 MX17004 2010 1 d1 NA NA
## 2 MX17004 2010 1 d10 NA NA
## 3 MX17004 2010 1 d11 NA NA
## 4 MX17004 2010 1 d12 NA NA
## 5 MX17004 2010 1 d13 NA NA
## 6 MX17004 2010 1 d14 NA NA
## 7 MX17004 2010 1 d15 NA NA
## 8 MX17004 2010 1 d16 NA NA
## 9 MX17004 2010 1 d17 NA NA
## 10 MX17004 2010 1 d18 NA NA
## # … with 331 more rows
In this dataset, temperature observations coded as NA
are considered missing. Now, we can imagine a few reasons why measurements would be missing in this dataset: (a) the measurement failed in a specific day for a specific weather station, or (b) certain stations only measure temperatures on specific days of the month, or (c) the measurement fails if the temperature is too high or too low. Knowing which of these applies can change how we approach this missing data. As you can see, how to treat missing data depends highly on how the data was obtained, and the more you know about a dataset, the better decision you can make.
In general, the central question with missing data is: should we remove observations with missing values, or should we impute missing values? In fact, can we do anything with a dataset that is missing data at all?
The answers to these require us to think about why the data is missing.
23.1 Mechanisms of missing data
For this discussion let’s assume we have an attribute \(y\) that contains missing data, a binary attribute \(r\) that encodes if observation in \(y\) is missing, and other attributes \(x\) in our dataset.
Also, we will make statements like depend or not depend, e.g., value of \(r_i\) does not depend on value of \(y_i\). For now, until we formalize this concept further, you can take this to mean the: properties of the distribution of \(r\) do not change based on values of \(y\).
Data missing completely at random (MCAR): missingness does not depend on any values of the missing or measured data. That is missingness \(r_i\) does not depend on the (unobserved) value \(y_i\) or on observed values \(x_i\). In this case, entities with missing data can be removed from the analysis safely. Imputation can be performed, see more below. In our weather example, this would be case (a): stations failed for no discernible reason.
Data missing at random (MAR): missingness \(r_i\) does not depend on value of \(y_i\), but may depend on the value of \(x_i\). Here, removing data can bias analysis since you would drop values of \(x\) based on missingness and potentially change the distribution of \(x\). Imputation can be done as well, see more below. In our weather example, this would be case (b): measurements are not taken on specific days of the month (where “day of the month” serves the role of \(x\)).
Data not missing at random (NMAR): missingness \(r_i\) depends on \(y_i\). This is the most pernicious of all, and usually means that we want to go back to our collaborator and tell them that we are in a bind. Removing or imputing data as discussed below is not appropriate in this case, and appropriate methods to deal with it are beyond the scope of this discussion. This is a good resource: https://www.wiley.com/en-us/Statistical+Data+Cleaning+with+Applications+in+R-p-9781118897157 (Ch. 10). In our weather example this would be case (c): measurements fail when the temperature is too hot or cold.
So in general, the first step when dealing with missing data is to understand why and how data may be missing. I.e., talk to collaborator, or person who created the dataset.
23.2 Handling missing data
23.2.1 Removing missing data
Once you know that data is MCAR and a relatively small fraction of observations have missing values, then it may be safe to remove observations.
## # A tibble: 33 x 6
## id year month day tmax tmin
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 MX17004 2010 1 d30 27.8 14.5
## 2 MX17004 2010 2 d11 29.7 13.4
## 3 MX17004 2010 2 d2 27.3 14.4
## 4 MX17004 2010 2 d23 29.9 10.7
## 5 MX17004 2010 2 d3 24.1 14.4
## 6 MX17004 2010 3 d10 34.5 16.8
## 7 MX17004 2010 3 d16 31.1 17.6
## 8 MX17004 2010 3 d5 32.1 14.2
## 9 MX17004 2010 4 d27 36.3 16.7
## 10 MX17004 2010 5 d27 33.2 18.2
## # … with 23 more rows
23.2.2 Encoding as missing
In the MCAR or MAR case for categorical attributes \(y\), a useful approach is to encode the fact that a value is missing as a new category and include that in subsequent analysis of attribute \(y\).
tb <- read_csv(file.path("data", "tb.csv"))
tidy_tb <- tb %>%
gather(demo, n, -iso2, -year) %>%
separate(demo, c("sex", "age"), sep=1)
tidy_tb %>%
tidyr::replace_na(list(iso2="missing"))
## # A tibble: 115,380 x 5
## iso2 year sex age n
## <chr> <dbl> <chr> <chr> <dbl>
## 1 AD 1989 m 04 NA
## 2 AD 1990 m 04 NA
## 3 AD 1991 m 04 NA
## 4 AD 1992 m 04 NA
## 5 AD 1993 m 04 NA
## 6 AD 1994 m 04 NA
## 7 AD 1996 m 04 NA
## 8 AD 1997 m 04 NA
## 9 AD 1998 m 04 NA
## 10 AD 1999 m 04 NA
## # … with 115,370 more rows
23.2.3 Imputation
23.2.3.1 MCAR (also for MAR, but this is not ideal)
In this case we can use a simple method for imputation of \(y\). For numeric attributes we replace missing values in \(y\) with the mean of non-missing values of \(y\).
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 14 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
For categorical attributes \(y\), we replace missing values with the most common category in the non-missing values of \(y\).
23.2.3.2 MAR
In this case we use a more complex method by replacing missing values for attribute \(y\) predicting from
other variables \(x\) when variables are related (we will see linear regression using the lm
and predict
functions later on)
dep_delay_fit <- flights %>% lm(dep_delay~origin, data=.)
# use average delay conditioned on origin airport
flights %>%
modelr::add_predictions(dep_delay_fit, var="pred_delay") %>%
mutate(dep_delay_fixed =
ifelse(!is.na(dep_delay), dep_delay,
pred_delay)) %>%
select(origin, dest, dep_delay, dep_delay_fixed) %>%
filter(is.na(dep_delay))
## # A tibble: 8,255 x 4
## origin dest dep_delay dep_delay_fixed
## <chr> <chr> <dbl> <dbl>
## 1 EWR RDU NA 15.1
## 2 LGA DFW NA 10.3
## 3 LGA MIA NA 10.3
## 4 JFK FLL NA 12.1
## 5 EWR CVG NA 15.1
## 6 EWR PIT NA 15.1
## 7 EWR MHT NA 15.1
## 8 EWR ATL NA 15.1
## 9 EWR IND NA 15.1
## 10 JFK LAX NA 12.1
## # … with 8,245 more rows
For categorical attributes we use a different kind of regression more appropriate to categorical attributes (logistic regression, again we will see that later on).
For both imputation methods, a common approach is to add an additional indicator variable stating if numeric missing value was imputed
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 15 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, dep_delay_missing <lgl>
23.3 Implications of imputation
Imputation has some effects that can impact analysis.
The central tendency of data is retained. For example, if we impute missing data using the mean of a numeric variable, the mean after imputation will not change. This is a good reason to impute based on estimates of central tendency.
The spread of the data will change. After imputation, the spread of the data will be smaller relative to spread if we ignore missing values. This could be problematic as underestimating the spread of data can yield over-confident inferences in downstream analysis.
We will not address these issues directly in later chapters, but you should be aware of this.