23 EDA: Handling Missing Data

We can now move on to a very important aspect of data preparation and transformation: how to deal with missing data? By missing data we mean values that are unrecorded, unknown or unspecified in a dataset. We saw an example of this when we looked at the tidy unit. Here is the tidy weather dataset again:

## # A tibble: 22 x 35
##    id       year month element    d1    d2    d3
##    <chr>   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>
##  1 MX17004  2010     1 tmax       NA  NA    NA  
##  2 MX17004  2010     1 tmin       NA  NA    NA  
##  3 MX17004  2010     2 tmax       NA  27.3  24.1
##  4 MX17004  2010     2 tmin       NA  14.4  14.4
##  5 MX17004  2010     3 tmax       NA  NA    NA  
##  6 MX17004  2010     3 tmin       NA  NA    NA  
##  7 MX17004  2010     4 tmax       NA  NA    NA  
##  8 MX17004  2010     4 tmin       NA  NA    NA  
##  9 MX17004  2010     5 tmax       NA  NA    NA  
## 10 MX17004  2010     5 tmin       NA  NA    NA  
## # … with 12 more rows, and 28 more variables:
## #   d4 <dbl>, d5 <dbl>, d6 <dbl>, d7 <dbl>,
## #   d8 <dbl>, d9 <lgl>, d10 <dbl>, d11 <dbl>,
## #   d12 <lgl>, d13 <dbl>, d14 <dbl>, d15 <dbl>,
## #   d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
## #   d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>,
## #   d24 <lgl>, d25 <dbl>, d26 <dbl>, d27 <dbl>,
## #   d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>

And the result of tidying this dataset:

tidy_weather <- weather %>%
  gather(day, temp, d1:d31) %>%
  spread(element, temp)
tidy_weather

## # A tibble: 341 x 6
##    id       year month day    tmax  tmin
##    <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
##  1 MX17004  2010     1 d1       NA    NA
##  2 MX17004  2010     1 d10      NA    NA
##  3 MX17004  2010     1 d11      NA    NA
##  4 MX17004  2010     1 d12      NA    NA
##  5 MX17004  2010     1 d13      NA    NA
##  6 MX17004  2010     1 d14      NA    NA
##  7 MX17004  2010     1 d15      NA    NA
##  8 MX17004  2010     1 d16      NA    NA
##  9 MX17004  2010     1 d17      NA    NA
## 10 MX17004  2010     1 d18      NA    NA
## # … with 331 more rows

In this dataset, temperature observations coded as NA are considered missing. Now, we can imagine a few reasons why measurements would be missing in this dataset: (a) the measurement failed in a specific day for a specific weather station, or (b) certain stations only measure temperatures on specific days of the month, or (c) the measurement fails if the temperature is too high or too low. Knowing which of these applies can change how we approach this missing data. As you can see, how to treat missing data depends highly on how the data was obtained, and the more you know about a dataset, the better decision you can make.

In general, the central question with missing data is: should we remove observations with missing values, or should we impute missing values? In fact, can we do anything with a dataset that is missing data at all?

The answers to these require us to think about why the data is missing.

23.1 Mechanisms of missing data

For this discussion let’s assume we have an attribute \(y\) that contains missing data, a binary attribute \(r\) that encodes if observation in \(y\) is missing, and other attributes \(x\) in our dataset.

Also, we will make statements like depend or not depend, e.g., value of \(r_i\) does not depend on value of \(y_i\). For now, until we formalize this concept further, you can take this to mean the: properties of the distribution of \(r\) do not change based on values of \(y\).

Data missing completely at random (MCAR): missingness does not depend on any values of the missing or measured data. That is missingness \(r_i\) does not depend on the (unobserved) value \(y_i\) or on observed values \(x_i\). In this case, entities with missing data can be removed from the analysis safely. Imputation can be performed, see more below. In our weather example, this would be case (a): stations failed for no discernible reason.

Data missing at random (MAR): missingness \(r_i\) does not depend on value of \(y_i\), but may depend on the value of \(x_i\). Here, removing data can bias analysis since you would drop values of \(x\) based on missingness and potentially change the distribution of \(x\). Imputation can be done as well, see more below. In our weather example, this would be case (b): measurements are not taken on specific days of the month (where “day of the month” serves the role of \(x\)).

Data not missing at random (NMAR): missingness \(r_i\) depends on \(y_i\). This is the most pernicious of all, and usually means that we want to go back to our collaborator and tell them that we are in a bind. Removing or imputing data as discussed below is not appropriate in this case, and appropriate methods to deal with it are beyond the scope of this discussion. This is a good resource: https://www.wiley.com/en-us/Statistical+Data+Cleaning+with+Applications+in+R-p-9781118897157 (Ch. 10). In our weather example this would be case (c): measurements fail when the temperature is too hot or cold.

So in general, the first step when dealing with missing data is to understand why and how data may be missing. I.e., talk to collaborator, or person who created the dataset.

23.2 Handling missing data

23.2.1 Removing missing data

Once you know that data is MCAR and a relatively small fraction of observations have missing values, then it may be safe to remove observations.

tidy_weather_nomissing <- tidy_weather %>%
  tidyr::drop_na(tmax, tmin)
tidy_weather_nomissing

## # A tibble: 33 x 6
##    id       year month day    tmax  tmin
##    <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
##  1 MX17004  2010     1 d30    27.8  14.5
##  2 MX17004  2010     2 d11    29.7  13.4
##  3 MX17004  2010     2 d2     27.3  14.4
##  4 MX17004  2010     2 d23    29.9  10.7
##  5 MX17004  2010     2 d3     24.1  14.4
##  6 MX17004  2010     3 d10    34.5  16.8
##  7 MX17004  2010     3 d16    31.1  17.6
##  8 MX17004  2010     3 d5     32.1  14.2
##  9 MX17004  2010     4 d27    36.3  16.7
## 10 MX17004  2010     5 d27    33.2  18.2
## # … with 23 more rows

23.2.2 Encoding as missing

In the MCAR or MAR case for categorical attributes \(y\), a useful approach is to encode the fact that a value is missing as a new category and include that in subsequent analysis of attribute \(y\).

tb <- read_csv(file.path("data", "tb.csv"))
tidy_tb <- tb %>%
  gather(demo, n, -iso2, -year)  %>%
  separate(demo, c("sex", "age"), sep=1)

tidy_tb %>%
  tidyr::replace_na(list(iso2="missing"))

## # A tibble: 115,380 x 5
##    iso2   year sex   age       n
##    <chr> <dbl> <chr> <chr> <dbl>
##  1 AD     1989 m     04       NA
##  2 AD     1990 m     04       NA
##  3 AD     1991 m     04       NA
##  4 AD     1992 m     04       NA
##  5 AD     1993 m     04       NA
##  6 AD     1994 m     04       NA
##  7 AD     1996 m     04       NA
##  8 AD     1997 m     04       NA
##  9 AD     1998 m     04       NA
## 10 AD     1999 m     04       NA
## # … with 115,370 more rows

23.2.3 Imputation

23.2.3.1 MCAR (also for MAR, but this is not ideal)

In this case we can use a simple method for imputation of \(y\). For numeric attributes we replace missing values in \(y\) with the mean of non-missing values of \(y\).

flights %>%
  tidyr::replace_na(list(dep_delay=mean(.$dep_delay, na.rm=TRUE)))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time
##    <int> <int> <int>    <int>          <int>
##  1  2013     1     1      517            515
##  2  2013     1     1      533            529
##  3  2013     1     1      542            540
##  4  2013     1     1      544            545
##  5  2013     1     1      554            600
##  6  2013     1     1      554            558
##  7  2013     1     1      555            600
##  8  2013     1     1      557            600
##  9  2013     1     1      557            600
## 10  2013     1     1      558            600
## # … with 336,766 more rows, and 14 more
## #   variables: dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

For categorical attributes \(y\), we replace missing values with the most common category in the non-missing values of \(y\).

23.2.3.2 MAR

In this case we use a more complex method by replacing missing values for attribute \(y\) predicting from other variables \(x\) when variables are related (we will see linear regression using the lm and predict functions later on)

dep_delay_fit <- flights %>% lm(dep_delay~origin, data=.)

# use average delay conditioned on origin airport
flights %>%
  modelr::add_predictions(dep_delay_fit, var="pred_delay") %>%
  mutate(dep_delay_fixed = 
           ifelse(!is.na(dep_delay), dep_delay, 
                  pred_delay)) %>%
  select(origin, dest, dep_delay, dep_delay_fixed) %>%
  filter(is.na(dep_delay))

## # A tibble: 8,255 x 4
##    origin dest  dep_delay dep_delay_fixed
##    <chr>  <chr>     <dbl>           <dbl>
##  1 EWR    RDU          NA            15.1
##  2 LGA    DFW          NA            10.3
##  3 LGA    MIA          NA            10.3
##  4 JFK    FLL          NA            12.1
##  5 EWR    CVG          NA            15.1
##  6 EWR    PIT          NA            15.1
##  7 EWR    MHT          NA            15.1
##  8 EWR    ATL          NA            15.1
##  9 EWR    IND          NA            15.1
## 10 JFK    LAX          NA            12.1
## # … with 8,245 more rows

For categorical attributes we use a different kind of regression more appropriate to categorical attributes (logistic regression, again we will see that later on).

For both imputation methods, a common approach is to add an additional indicator variable stating if numeric missing value was imputed

flights %>%
  mutate(dep_delay_missing = is.na(dep_delay))

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time
##    <int> <int> <int>    <int>          <int>
##  1  2013     1     1      517            515
##  2  2013     1     1      533            529
##  3  2013     1     1      542            540
##  4  2013     1     1      544            545
##  5  2013     1     1      554            600
##  6  2013     1     1      554            558
##  7  2013     1     1      555            600
##  8  2013     1     1      557            600
##  9  2013     1     1      557            600
## 10  2013     1     1      558            600
## # … with 336,766 more rows, and 15 more
## #   variables: dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, dep_delay_missing <lgl>

23.3 Implications of imputation

Imputation has some effects that can impact analysis.

The central tendency of data is retained. For example, if we impute missing data using the mean of a numeric variable, the mean after imputation will not change. This is a good reason to impute based on estimates of central tendency.
The spread of the data will change. After imputation, the spread of the data will be smaller relative to spread if we ignore missing values. This could be problematic as underestimating the spread of data can yield over-confident inferences in downstream analysis.

We will not address these issues directly in later chapters, but you should be aware of this.