17 Tidying data

This section is concerned with common problems in data preparation, namely use cases commonly found in raw datasets that need to be addressed to turn messy data into tidy data. These would be operations that you would perform on data obtained as a csv file from a collaborator or data repository, or as the result of scraping data from webpages or other sources. We derive many of our ideas from the paper Tidy Data by Hadley Wickham. Associated with that paper we will use two very powerful R libraries tidyr and dplyr which are extremely useful in writing scripts for data cleaning, preparation and summarization. A basic design principle behind these libraries is trying to effectively and efficiently capture very common use cases and operations performed in data cleaning. The paper frames these use cases and operations which are them implemented in software.

17.1 Tidy Data

Here we assume we are working with a data model based on rectangular data structures where

  1. Each attribute (or variable) forms a column
  2. Each entity (or observation) forms a row
  3. Each type of entity (observational unit) forms a table

Here is an example of a tidy dataset:

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time
##   <int> <int> <int>    <int>          <int>
## 1  2013     1     1      517            515
## 2  2013     1     1      533            529
## 3  2013     1     1      542            540
## 4  2013     1     1      544            545
## 5  2013     1     1      554            600
## 6  2013     1     1      554            558
## # … with 14 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

it has one observation per row, a single variable per column. Notice only information about flights are included here (e.g., no airport information other than the name) in these observations.

17.2 Common problems in messy data

The set of common operations we will study are based on these common problems found in datasets. We will see each one in detail:

  • Column headers are values, not variable names (gather)
  • Multiple variables stored in one column (split)
  • Variables stored in both rows and column (rotate)
  • Multiple types of observational units are stored in the same table (normalize)
  • Single observational unit stored in multiple tables (join)

We are using data from Hadley’s paper found in github. It’s included directory data:

17.2.1 Headers as values

The first problem we’ll see is the case where a table header contains values.

## Parsed with column specification:
## cols(
##   religion = col_character(),
##   `<$10k` = col_double(),
##   `$10-20k` = col_double(),
##   `$20-30k` = col_double(),
##   `$30-40k` = col_double(),
##   `$40-50k` = col_double(),
##   `$50-75k` = col_double(),
##   `$75-100k` = col_double(),
##   `$100-150k` = col_double(),
##   `>150k` = col_double(),
##   `Don't know/refused` = col_double()
## )
## # A tibble: 18 x 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>
##  1 Agnostic      27        34        60        81
##  2 Atheist       12        27        37        52
##  3 Buddhist      27        21        30        34
##  4 Catholic     418       617       732       670
##  5 Don’t k…      15        14        15        11
##  6 Evangel…     575       869      1064       982
##  7 Hindu          1         9         7         9
##  8 Histori…     228       244       236       238
##  9 Jehovah…      20        27        24        24
## 10 Jewish        19        19        25        25
## 11 Mainlin…     289       495       619       655
## 12 Mormon        29        40        48        51
## 13 Muslim         6         7         9        10
## 14 Orthodox      13        17        23        32
## 15 Other C…       9         7        11        13
## 16 Other F…      20        33        40        46
## 17 Other W…       5         2         3         4
## 18 Unaffil…     217       299       374       365
## # … with 6 more variables: `$40-50k` <dbl>,
## #   `$50-75k` <dbl>, `$75-100k` <dbl>,
## #   `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>

This table has the number of survey respondents of a specific religion that report their income within some range. A tidy version of this table would consider the variables of each observation to be religion, income, frequency where frequency has the number of respondents for each religion and income range. The function to use in the tidyr package is gather:

## # A tibble: 180 x 3
##    religion                income frequency
##    <chr>                   <chr>      <dbl>
##  1 Agnostic                <$10k         27
##  2 Atheist                 <$10k         12
##  3 Buddhist                <$10k         27
##  4 Catholic                <$10k        418
##  5 Don’t know/refused      <$10k         15
##  6 Evangelical Prot        <$10k        575
##  7 Hindu                   <$10k          1
##  8 Historically Black Prot <$10k        228
##  9 Jehovah's Witness       <$10k         20
## 10 Jewish                  <$10k         19
## # … with 170 more rows

This says: gather all the columns from the pew (except religion) into key-value columns income and frequency. This table is much easier to use in other analyses.

Another example: this table has a row for each song appearing in the billboard top 100. It contains track information, and the date it entered the top 100. It then shows the rank in each of the next 76 weeks.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   artist = col_character(),
##   track = col_character(),
##   time = col_time(format = ""),
##   date.entered = col_date(format = ""),
##   wk66 = col_logical(),
##   wk67 = col_logical(),
##   wk68 = col_logical(),
##   wk69 = col_logical(),
##   wk70 = col_logical(),
##   wk71 = col_logical(),
##   wk72 = col_logical(),
##   wk73 = col_logical(),
##   wk74 = col_logical(),
##   wk75 = col_logical(),
##   wk76 = col_logical()
## )
## See spec(...) for full column specifications.
## # A tibble: 317 x 81
##     year artist track time  date.entered   wk1
##    <dbl> <chr>  <chr> <tim> <date>       <dbl>
##  1  2000 2 Pac  Baby… 04:22 2000-02-26      87
##  2  2000 2Ge+h… The … 03:15 2000-09-02      91
##  3  2000 3 Doo… Kryp… 03:53 2000-04-08      81
##  4  2000 3 Doo… Loser 04:24 2000-10-21      76
##  5  2000 504 B… Wobb… 03:35 2000-04-15      57
##  6  2000 98^0   Give… 03:24 2000-08-19      51
##  7  2000 A*Tee… Danc… 03:44 2000-07-08      97
##  8  2000 Aaliy… I Do… 04:15 2000-01-29      84
##  9  2000 Aaliy… Try … 04:03 2000-03-18      59
## 10  2000 Adams… Open… 05:30 2000-08-26      76
## # … with 307 more rows, and 75 more variables:
## #   wk2 <dbl>, wk3 <dbl>, wk4 <dbl>, wk5 <dbl>,
## #   wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>,
## #   wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## #   wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## #   wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## #   wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## #   wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## #   wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>,
## #   wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
## #   wk49 <dbl>, wk50 <dbl>, wk51 <dbl>,
## #   wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
## #   wk55 <dbl>, wk56 <dbl>, wk57 <dbl>,
## #   wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
## #   wk61 <dbl>, wk62 <dbl>, wk63 <dbl>,
## #   wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
## #   wk67 <lgl>, wk68 <lgl>, wk69 <lgl>,
## #   wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
## #   wk73 <lgl>, wk74 <lgl>, wk75 <lgl>,
## #   wk76 <lgl>

Challenge: This dataset has values as column names. Which column names are values? How do we tidy this dataset?

17.2.2 Multiple variables in one column

The next problem we’ll see is the case when we see multiple variables in a single column. Consider the following dataset of tuberculosis cases:

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   iso2 = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 5,769 x 22
##    iso2   year   m04  m514  m014 m1524 m2534
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 AD     1989    NA    NA    NA    NA    NA
##  2 AD     1990    NA    NA    NA    NA    NA
##  3 AD     1991    NA    NA    NA    NA    NA
##  4 AD     1992    NA    NA    NA    NA    NA
##  5 AD     1993    NA    NA    NA    NA    NA
##  6 AD     1994    NA    NA    NA    NA    NA
##  7 AD     1996    NA    NA     0     0     0
##  8 AD     1997    NA    NA     0     0     1
##  9 AD     1998    NA    NA     0     0     0
## 10 AD     1999    NA    NA     0     0     0
## # … with 5,759 more rows, and 15 more variables:
## #   m3544 <dbl>, m4554 <dbl>, m5564 <dbl>,
## #   m65 <dbl>, mu <dbl>, f04 <dbl>, f514 <dbl>,
## #   f014 <dbl>, f1524 <dbl>, f2534 <dbl>,
## #   f3544 <dbl>, f4554 <dbl>, f5564 <dbl>,
## #   f65 <dbl>, fu <dbl>

This table has a row for each year and strain of tuberculosis (given by the first two columns). The remaining columns state the number of cases for a given demographic. For example, m1524 corresponds to males between 15 and 24 years old, and f1524 are females age 15-24. As you can see each of these columns has two variables: sex and age.

Challenge: what else is untidy about this dataset?

So, we have to do two operations to tidy this table, first we need to use gather the tabulation columns into a demo and n columns (for demographic and number of cases):

## # A tibble: 115,380 x 4
##    iso2   year demo      n
##    <chr> <dbl> <chr> <dbl>
##  1 AD     1989 m04      NA
##  2 AD     1990 m04      NA
##  3 AD     1991 m04      NA
##  4 AD     1992 m04      NA
##  5 AD     1993 m04      NA
##  6 AD     1994 m04      NA
##  7 AD     1996 m04      NA
##  8 AD     1997 m04      NA
##  9 AD     1998 m04      NA
## 10 AD     1999 m04      NA
## # … with 115,370 more rows

Next, we need to separate the values in the demo column into two variables sex and age

## # A tibble: 115,380 x 5
##    iso2   year sex   age       n
##    <chr> <dbl> <chr> <chr> <dbl>
##  1 AD     1989 m     04       NA
##  2 AD     1990 m     04       NA
##  3 AD     1991 m     04       NA
##  4 AD     1992 m     04       NA
##  5 AD     1993 m     04       NA
##  6 AD     1994 m     04       NA
##  7 AD     1996 m     04       NA
##  8 AD     1997 m     04       NA
##  9 AD     1998 m     04       NA
## 10 AD     1999 m     04       NA
## # … with 115,370 more rows

This calls the separate function on table tidy_db, separating the demo variable into variables sex and age by separating each value after the first character (that’s the sep argument).

We can put these two commands together in a pipeline:

## # A tibble: 115,380 x 5
##    iso2   year sex   age       n
##    <chr> <dbl> <chr> <chr> <dbl>
##  1 AD     1989 m     04       NA
##  2 AD     1990 m     04       NA
##  3 AD     1991 m     04       NA
##  4 AD     1992 m     04       NA
##  5 AD     1993 m     04       NA
##  6 AD     1994 m     04       NA
##  7 AD     1996 m     04       NA
##  8 AD     1997 m     04       NA
##  9 AD     1998 m     04       NA
## 10 AD     1999 m     04       NA
## # … with 115,370 more rows

17.2.3 Variables stored in both rows and columns

This is the messiest, commonly found type of data. Let’s take a look at an example, this is daily weather data from for one weather station in Mexico in 2010.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_character(),
##   element = col_character(),
##   d9 = col_logical(),
##   d12 = col_logical(),
##   d18 = col_logical(),
##   d19 = col_logical(),
##   d20 = col_logical(),
##   d21 = col_logical(),
##   d22 = col_logical(),
##   d24 = col_logical()
## )
## See spec(...) for full column specifications.
## # A tibble: 22 x 35
##    id       year month element    d1    d2    d3
##    <chr>   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>
##  1 MX17004  2010     1 tmax       NA  NA    NA  
##  2 MX17004  2010     1 tmin       NA  NA    NA  
##  3 MX17004  2010     2 tmax       NA  27.3  24.1
##  4 MX17004  2010     2 tmin       NA  14.4  14.4
##  5 MX17004  2010     3 tmax       NA  NA    NA  
##  6 MX17004  2010     3 tmin       NA  NA    NA  
##  7 MX17004  2010     4 tmax       NA  NA    NA  
##  8 MX17004  2010     4 tmin       NA  NA    NA  
##  9 MX17004  2010     5 tmax       NA  NA    NA  
## 10 MX17004  2010     5 tmin       NA  NA    NA  
## # … with 12 more rows, and 28 more variables:
## #   d4 <dbl>, d5 <dbl>, d6 <dbl>, d7 <dbl>,
## #   d8 <dbl>, d9 <lgl>, d10 <dbl>, d11 <dbl>,
## #   d12 <lgl>, d13 <dbl>, d14 <dbl>, d15 <dbl>,
## #   d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
## #   d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>,
## #   d24 <lgl>, d25 <dbl>, d26 <dbl>, d27 <dbl>,
## #   d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>

So, we have two rows for each month, one with maximum daily temperature, one with minimum daily temperature, the columns starting with d correspond to the day in the where the measurements were made.

Challenge: How would a tidy version of this data look like?

## # A tibble: 33 x 6
##    id       year month day    tmax  tmin
##    <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
##  1 MX17004  2010     1 d30    27.8  14.5
##  2 MX17004  2010     2 d11    29.7  13.4
##  3 MX17004  2010     2 d2     27.3  14.4
##  4 MX17004  2010     2 d23    29.9  10.7
##  5 MX17004  2010     2 d3     24.1  14.4
##  6 MX17004  2010     3 d10    34.5  16.8
##  7 MX17004  2010     3 d16    31.1  17.6
##  8 MX17004  2010     3 d5     32.1  14.2
##  9 MX17004  2010     4 d27    36.3  16.7
## 10 MX17004  2010     5 d27    33.2  18.2
## # … with 23 more rows

The new function we’ve used here is spread. It does the inverse of gather it spreads columns element and value into separate columns.

17.2.4 Multiple types in one table

Remember that an important aspect of tidy data is that it contains exactly one kind of observation in a single table. Let’s see the billboard example again after the gather operation we did before:

## # A tibble: 5,307 x 7
##     year artist track time  date.entered week 
##    <dbl> <chr>  <chr> <tim> <date>       <chr>
##  1  2000 2 Pac  Baby… 04:22 2000-02-26   wk1  
##  2  2000 2Ge+h… The … 03:15 2000-09-02   wk1  
##  3  2000 3 Doo… Kryp… 03:53 2000-04-08   wk1  
##  4  2000 3 Doo… Loser 04:24 2000-10-21   wk1  
##  5  2000 504 B… Wobb… 03:35 2000-04-15   wk1  
##  6  2000 98^0   Give… 03:24 2000-08-19   wk1  
##  7  2000 A*Tee… Danc… 03:44 2000-07-08   wk1  
##  8  2000 Aaliy… I Do… 04:15 2000-01-29   wk1  
##  9  2000 Aaliy… Try … 04:03 2000-03-18   wk1  
## 10  2000 Adams… Open… 05:30 2000-08-26   wk1  
## # … with 5,297 more rows, and 1 more variable:
## #   rank <dbl>

Let’s sort this table by track to see a problem with this table:

## # A tibble: 5,307 x 7
##     year artist track time  date.entered week 
##    <dbl> <chr>  <chr> <tim> <date>       <chr>
##  1  2000 Nelly  (Hot… 04:17 2000-04-29   wk1  
##  2  2000 Nelly  (Hot… 04:17 2000-04-29   wk2  
##  3  2000 Nelly  (Hot… 04:17 2000-04-29   wk3  
##  4  2000 Nelly  (Hot… 04:17 2000-04-29   wk4  
##  5  2000 Nelly  (Hot… 04:17 2000-04-29   wk5  
##  6  2000 Nelly  (Hot… 04:17 2000-04-29   wk6  
##  7  2000 Nelly  (Hot… 04:17 2000-04-29   wk7  
##  8  2000 Nelly  (Hot… 04:17 2000-04-29   wk8  
##  9  2000 Nelly  (Hot… 04:17 2000-04-29   wk9  
## 10  2000 Nelly  (Hot… 04:17 2000-04-29   wk10 
## # … with 5,297 more rows, and 1 more variable:
## #   rank <dbl>

We have a lot of repeated information in many of these rows (the artist, track name, year, title and date entered). The problem is that this table contains information about both tracks and rank in billboard. That’s two different kinds of observations that should belong in two different tables in a tidy dataset.

Let’s make a song table that only includes information about songs:

## # A tibble: 317 x 5
##    artist    track        year time  date.entered
##    <chr>     <chr>       <dbl> <tim> <date>      
##  1 Nelly     (Hot S**t)…  2000 04:17 2000-04-29  
##  2 Nu Flavor 3 Little W…  2000 03:54 2000-06-03  
##  3 Jean, Wy… 911          2000 04:00 2000-10-07  
##  4 Brock, C… A Country …  2000 03:54 2000-01-01  
##  5 Clark, T… A Little G…  2000 03:07 2000-12-16  
##  6 Son By F… A Puro Dol…  2000 03:30 2000-04-08  
##  7 Carter, … Aaron's Pa…  2000 03:23 2000-08-26  
##  8 Nine Days Absolutely…  2000 03:09 2000-05-06  
##  9 De La So… All Good?    2000 05:02 2000-12-23  
## 10 Blink-182 All The Sm…  2000 02:52 1999-12-04  
## # … with 307 more rows

The unique function removes any duplicate rows in a table. That’s how we have a single row for each song.

Next, we would like to remove all the song information from the rank table. But we need to do it in a way that still remembers which song each ranking observation corresponds to. To do that, let’s first give each song an identifier that we can use to link songs and rankings. So, we can produce the final version of our song table like this:

## # A tibble: 317 x 6
##    artist  track  year time  date.entered song_id
##    <chr>   <chr> <dbl> <tim> <date>         <int>
##  1 Nelly   (Hot…  2000 04:17 2000-04-29         1
##  2 Nu Fla… 3 Li…  2000 03:54 2000-06-03         2
##  3 Jean, … 911    2000 04:00 2000-10-07         3
##  4 Brock,… A Co…  2000 03:54 2000-01-01         4
##  5 Clark,… A Li…  2000 03:07 2000-12-16         5
##  6 Son By… A Pu…  2000 03:30 2000-04-08         6
##  7 Carter… Aaro…  2000 03:23 2000-08-26         7
##  8 Nine D… Abso…  2000 03:09 2000-05-06         8
##  9 De La … All …  2000 05:02 2000-12-23         9
## 10 Blink-… All …  2000 02:52 1999-12-04        10
## # … with 307 more rows

The mutate function adds a new column to the table, in this case with column name song_id and value the row number the song appears in the table (from the row_number column).

Now we can make a rank table, we combine the tidy billboard table with our new song table using a join (we’ll learn all about joins later). It checks the values on each row of the billboard table and looks for rows in the song table that have the exact same values, and makes a new row that combines the information from both tables.

## # A tibble: 5,307 x 8
##     year artist track time  date.entered week 
##    <dbl> <chr>  <chr> <tim> <date>       <chr>
##  1  2000 Nelly  (Hot… 04:17 2000-04-29   wk1  
##  2  2000 Nelly  (Hot… 04:17 2000-04-29   wk2  
##  3  2000 Nelly  (Hot… 04:17 2000-04-29   wk3  
##  4  2000 Nelly  (Hot… 04:17 2000-04-29   wk4  
##  5  2000 Nelly  (Hot… 04:17 2000-04-29   wk5  
##  6  2000 Nelly  (Hot… 04:17 2000-04-29   wk6  
##  7  2000 Nelly  (Hot… 04:17 2000-04-29   wk7  
##  8  2000 Nelly  (Hot… 04:17 2000-04-29   wk8  
##  9  2000 Nelly  (Hot… 04:17 2000-04-29   wk9  
## 10  2000 Nelly  (Hot… 04:17 2000-04-29   wk10 
## # … with 5,297 more rows, and 2 more variables:
## #   rank <dbl>, song_id <int>

That adds the song_id variable to the tidy_billboard table. So now we can remove the song information and only keep ranking information and the song_id.

## # A tibble: 5,307 x 3
##    song_id week   rank
##      <int> <chr> <dbl>
##  1       1 wk1     100
##  2       1 wk2      99
##  3       1 wk3      96
##  4       1 wk4      76
##  5       1 wk5      55
##  6       1 wk6      37
##  7       1 wk7      24
##  8       1 wk8      24
##  9       1 wk9      30
## 10       1 wk10     36
## # … with 5,297 more rows

Challenge: Let’s do a little better job at tidying the billboard dataset:

  1. When using gather to make the week and rank columns, remove any weeks where the song does not appear in the top 100. This is coded as missing (NA). See the na.rm argument to gather.
  2. Make week a numeric variable (i.e., remove wk). See what the extract_numeric function does.
  3. Instead of date.entered add a date column that states the actual date of each ranking. See how R deals with dates ?Date and how you can turn a string into a Date using as.Date.
  4. Sort the resulting table by date and rank.
  5. Make new song and rank tables. song will now not have the date.entered column, and rank will have the new date column you have just created.