17 Tidying data
This section is concerned with common problems in data preparation, namely use cases commonly found in raw datasets that need to be addressed to turn messy data into tidy data. These would be operations that you would perform on data obtained as a csv file from a collaborator or data repository, or as the result of scraping data from webpages or other sources. We derive many of our ideas from the paper Tidy Data by Hadley Wickham. Associated with that paper we will use two very powerful R libraries tidyr
and dplyr
which are extremely useful in writing scripts for data cleaning, preparation and summarization. A basic design principle behind these libraries is trying to effectively and efficiently capture very common use cases and operations performed in data cleaning. The paper frames these use cases and operations which are them implemented in software.
17.1 Tidy Data
Here we assume we are working with a data model based on rectangular data structures where
- Each attribute (or variable) forms a column
- Each entity (or observation) forms a row
- Each type of entity (observational unit) forms a table
Here is an example of a tidy dataset:
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## # … with 14 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
it has one observation per row, a single variable per column. Notice only information about flights are included here (e.g., no airport information other than the name) in these observations.
17.2 Common problems in messy data
The set of common operations we will study are based on these common problems found in datasets. We will see each one in detail:
- Column headers are values, not variable names (gather)
- Multiple variables stored in one column (split)
- Variables stored in both rows and column (rotate)
- Multiple types of observational units are stored in the same table (normalize)
- Single observational unit stored in multiple tables (join)
We are using data from Hadley’s paper found in github. It’s included directory data
:
17.2.1 Headers as values
The first problem we’ll see is the case where a table header contains values.
## Parsed with column specification:
## cols(
## religion = col_character(),
## `<$10k` = col_double(),
## `$10-20k` = col_double(),
## `$20-30k` = col_double(),
## `$30-40k` = col_double(),
## `$40-50k` = col_double(),
## `$50-75k` = col_double(),
## `$75-100k` = col_double(),
## `$100-150k` = col_double(),
## `>150k` = col_double(),
## `Don't know/refused` = col_double()
## )
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81
## 2 Atheist 12 27 37 52
## 3 Buddhist 27 21 30 34
## 4 Catholic 418 617 732 670
## 5 Don’t k… 15 14 15 11
## 6 Evangel… 575 869 1064 982
## 7 Hindu 1 9 7 9
## 8 Histori… 228 244 236 238
## 9 Jehovah… 20 27 24 24
## 10 Jewish 19 19 25 25
## 11 Mainlin… 289 495 619 655
## 12 Mormon 29 40 48 51
## 13 Muslim 6 7 9 10
## 14 Orthodox 13 17 23 32
## 15 Other C… 9 7 11 13
## 16 Other F… 20 33 40 46
## 17 Other W… 5 2 3 4
## 18 Unaffil… 217 299 374 365
## # … with 6 more variables: `$40-50k` <dbl>,
## # `$50-75k` <dbl>, `$75-100k` <dbl>,
## # `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>
This table has the number of survey respondents of a specific religion that report their income within some range. A tidy version of this table would consider the variables of each observation to be religion, income, frequency
where frequency
has the number of respondents for each religion and income range. The function to use in the tidyr
package is gather
:
## # A tibble: 180 x 3
## religion income frequency
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don’t know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
## 7 Hindu <$10k 1
## 8 Historically Black Prot <$10k 228
## 9 Jehovah's Witness <$10k 20
## 10 Jewish <$10k 19
## # … with 170 more rows
This says: gather all the columns from the pew
(except religion
) into key-value columns income
and frequency
. This table is much easier to use in other analyses.
Another example: this table has a row for each song appearing in the billboard top 100. It contains track information, and the date it entered the top 100. It then shows the rank in each of the next 76 weeks.
## Parsed with column specification:
## cols(
## .default = col_double(),
## artist = col_character(),
## track = col_character(),
## time = col_time(format = ""),
## date.entered = col_date(format = ""),
## wk66 = col_logical(),
## wk67 = col_logical(),
## wk68 = col_logical(),
## wk69 = col_logical(),
## wk70 = col_logical(),
## wk71 = col_logical(),
## wk72 = col_logical(),
## wk73 = col_logical(),
## wk74 = col_logical(),
## wk75 = col_logical(),
## wk76 = col_logical()
## )
## See spec(...) for full column specifications.
## # A tibble: 317 x 81
## year artist track time date.entered wk1
## <dbl> <chr> <chr> <tim> <date> <dbl>
## 1 2000 2 Pac Baby… 04:22 2000-02-26 87
## 2 2000 2Ge+h… The … 03:15 2000-09-02 91
## 3 2000 3 Doo… Kryp… 03:53 2000-04-08 81
## 4 2000 3 Doo… Loser 04:24 2000-10-21 76
## 5 2000 504 B… Wobb… 03:35 2000-04-15 57
## 6 2000 98^0 Give… 03:24 2000-08-19 51
## 7 2000 A*Tee… Danc… 03:44 2000-07-08 97
## 8 2000 Aaliy… I Do… 04:15 2000-01-29 84
## 9 2000 Aaliy… Try … 04:03 2000-03-18 59
## 10 2000 Adams… Open… 05:30 2000-08-26 76
## # … with 307 more rows, and 75 more variables:
## # wk2 <dbl>, wk3 <dbl>, wk4 <dbl>, wk5 <dbl>,
## # wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>,
## # wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## # wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## # wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## # wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## # wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## # wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>,
## # wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
## # wk49 <dbl>, wk50 <dbl>, wk51 <dbl>,
## # wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
## # wk55 <dbl>, wk56 <dbl>, wk57 <dbl>,
## # wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
## # wk61 <dbl>, wk62 <dbl>, wk63 <dbl>,
## # wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
## # wk67 <lgl>, wk68 <lgl>, wk69 <lgl>,
## # wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
## # wk73 <lgl>, wk74 <lgl>, wk75 <lgl>,
## # wk76 <lgl>
Challenge: This dataset has values as column names. Which column names are values? How do we tidy this dataset?
17.2.2 Multiple variables in one column
The next problem we’ll see is the case when we see multiple variables in a single column. Consider the following dataset of tuberculosis cases:
## Parsed with column specification:
## cols(
## .default = col_double(),
## iso2 = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 5,769 x 22
## iso2 year m04 m514 m014 m1524 m2534
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AD 1989 NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA
## 7 AD 1996 NA NA 0 0 0
## 8 AD 1997 NA NA 0 0 1
## 9 AD 1998 NA NA 0 0 0
## 10 AD 1999 NA NA 0 0 0
## # … with 5,759 more rows, and 15 more variables:
## # m3544 <dbl>, m4554 <dbl>, m5564 <dbl>,
## # m65 <dbl>, mu <dbl>, f04 <dbl>, f514 <dbl>,
## # f014 <dbl>, f1524 <dbl>, f2534 <dbl>,
## # f3544 <dbl>, f4554 <dbl>, f5564 <dbl>,
## # f65 <dbl>, fu <dbl>
This table has a row for each year and strain of tuberculosis (given by the first two columns). The remaining columns state the number of cases for a given demographic. For example, m1524
corresponds to males between 15 and 24 years old, and f1524
are females age 15-24. As you can see each of these columns has two variables: sex
and age
.
Challenge: what else is untidy about this dataset?
So, we have to do two operations to tidy this table, first we need to use gather
the tabulation columns into a demo
and n
columns (for demographic and number of cases):
## # A tibble: 115,380 x 4
## iso2 year demo n
## <chr> <dbl> <chr> <dbl>
## 1 AD 1989 m04 NA
## 2 AD 1990 m04 NA
## 3 AD 1991 m04 NA
## 4 AD 1992 m04 NA
## 5 AD 1993 m04 NA
## 6 AD 1994 m04 NA
## 7 AD 1996 m04 NA
## 8 AD 1997 m04 NA
## 9 AD 1998 m04 NA
## 10 AD 1999 m04 NA
## # … with 115,370 more rows
Next, we need to separate
the values in the demo
column into two variables sex
and age
## # A tibble: 115,380 x 5
## iso2 year sex age n
## <chr> <dbl> <chr> <chr> <dbl>
## 1 AD 1989 m 04 NA
## 2 AD 1990 m 04 NA
## 3 AD 1991 m 04 NA
## 4 AD 1992 m 04 NA
## 5 AD 1993 m 04 NA
## 6 AD 1994 m 04 NA
## 7 AD 1996 m 04 NA
## 8 AD 1997 m 04 NA
## 9 AD 1998 m 04 NA
## 10 AD 1999 m 04 NA
## # … with 115,370 more rows
This calls the separate
function on table tidy_db
, separating the demo
variable into variables sex
and age
by separating each value after the first character (that’s the sep
argument).
We can put these two commands together in a pipeline:
## # A tibble: 115,380 x 5
## iso2 year sex age n
## <chr> <dbl> <chr> <chr> <dbl>
## 1 AD 1989 m 04 NA
## 2 AD 1990 m 04 NA
## 3 AD 1991 m 04 NA
## 4 AD 1992 m 04 NA
## 5 AD 1993 m 04 NA
## 6 AD 1994 m 04 NA
## 7 AD 1996 m 04 NA
## 8 AD 1997 m 04 NA
## 9 AD 1998 m 04 NA
## 10 AD 1999 m 04 NA
## # … with 115,370 more rows
17.2.3 Variables stored in both rows and columns
This is the messiest, commonly found type of data. Let’s take a look at an example, this is daily weather data from for one weather station in Mexico in 2010.
## Parsed with column specification:
## cols(
## .default = col_double(),
## id = col_character(),
## element = col_character(),
## d9 = col_logical(),
## d12 = col_logical(),
## d18 = col_logical(),
## d19 = col_logical(),
## d20 = col_logical(),
## d21 = col_logical(),
## d22 = col_logical(),
## d24 = col_logical()
## )
## See spec(...) for full column specifications.
## # A tibble: 22 x 35
## id year month element d1 d2 d3
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 MX17004 2010 1 tmax NA NA NA
## 2 MX17004 2010 1 tmin NA NA NA
## 3 MX17004 2010 2 tmax NA 27.3 24.1
## 4 MX17004 2010 2 tmin NA 14.4 14.4
## 5 MX17004 2010 3 tmax NA NA NA
## 6 MX17004 2010 3 tmin NA NA NA
## 7 MX17004 2010 4 tmax NA NA NA
## 8 MX17004 2010 4 tmin NA NA NA
## 9 MX17004 2010 5 tmax NA NA NA
## 10 MX17004 2010 5 tmin NA NA NA
## # … with 12 more rows, and 28 more variables:
## # d4 <dbl>, d5 <dbl>, d6 <dbl>, d7 <dbl>,
## # d8 <dbl>, d9 <lgl>, d10 <dbl>, d11 <dbl>,
## # d12 <lgl>, d13 <dbl>, d14 <dbl>, d15 <dbl>,
## # d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>,
## # d20 <lgl>, d21 <lgl>, d22 <lgl>, d23 <dbl>,
## # d24 <lgl>, d25 <dbl>, d26 <dbl>, d27 <dbl>,
## # d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>
So, we have two rows for each month, one with maximum daily temperature, one with minimum daily temperature, the columns starting with d
correspond to the day in the where the measurements were made.
Challenge: How would a tidy version of this data look like?
## # A tibble: 33 x 6
## id year month day tmax tmin
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 MX17004 2010 1 d30 27.8 14.5
## 2 MX17004 2010 2 d11 29.7 13.4
## 3 MX17004 2010 2 d2 27.3 14.4
## 4 MX17004 2010 2 d23 29.9 10.7
## 5 MX17004 2010 2 d3 24.1 14.4
## 6 MX17004 2010 3 d10 34.5 16.8
## 7 MX17004 2010 3 d16 31.1 17.6
## 8 MX17004 2010 3 d5 32.1 14.2
## 9 MX17004 2010 4 d27 36.3 16.7
## 10 MX17004 2010 5 d27 33.2 18.2
## # … with 23 more rows
The new function we’ve used here is spread
. It does the inverse of gather
it spreads columns element
and value
into separate columns.
17.2.4 Multiple types in one table
Remember that an important aspect of tidy data is that it contains exactly one kind of observation in a single table. Let’s see the billboard example again after the gather
operation we did before:
## # A tibble: 5,307 x 7
## year artist track time date.entered week
## <dbl> <chr> <chr> <tim> <date> <chr>
## 1 2000 2 Pac Baby… 04:22 2000-02-26 wk1
## 2 2000 2Ge+h… The … 03:15 2000-09-02 wk1
## 3 2000 3 Doo… Kryp… 03:53 2000-04-08 wk1
## 4 2000 3 Doo… Loser 04:24 2000-10-21 wk1
## 5 2000 504 B… Wobb… 03:35 2000-04-15 wk1
## 6 2000 98^0 Give… 03:24 2000-08-19 wk1
## 7 2000 A*Tee… Danc… 03:44 2000-07-08 wk1
## 8 2000 Aaliy… I Do… 04:15 2000-01-29 wk1
## 9 2000 Aaliy… Try … 04:03 2000-03-18 wk1
## 10 2000 Adams… Open… 05:30 2000-08-26 wk1
## # … with 5,297 more rows, and 1 more variable:
## # rank <dbl>
Let’s sort this table by track to see a problem with this table:
## # A tibble: 5,307 x 7
## year artist track time date.entered week
## <dbl> <chr> <chr> <tim> <date> <chr>
## 1 2000 Nelly (Hot… 04:17 2000-04-29 wk1
## 2 2000 Nelly (Hot… 04:17 2000-04-29 wk2
## 3 2000 Nelly (Hot… 04:17 2000-04-29 wk3
## 4 2000 Nelly (Hot… 04:17 2000-04-29 wk4
## 5 2000 Nelly (Hot… 04:17 2000-04-29 wk5
## 6 2000 Nelly (Hot… 04:17 2000-04-29 wk6
## 7 2000 Nelly (Hot… 04:17 2000-04-29 wk7
## 8 2000 Nelly (Hot… 04:17 2000-04-29 wk8
## 9 2000 Nelly (Hot… 04:17 2000-04-29 wk9
## 10 2000 Nelly (Hot… 04:17 2000-04-29 wk10
## # … with 5,297 more rows, and 1 more variable:
## # rank <dbl>
We have a lot of repeated information in many of these rows (the artist, track name, year, title and date entered). The problem is that this table contains information about both tracks and rank in billboard. That’s two different kinds of observations that should belong in two different tables in a tidy dataset.
Let’s make a song table that only includes information about songs:
## # A tibble: 317 x 5
## artist track year time date.entered
## <chr> <chr> <dbl> <tim> <date>
## 1 Nelly (Hot S**t)… 2000 04:17 2000-04-29
## 2 Nu Flavor 3 Little W… 2000 03:54 2000-06-03
## 3 Jean, Wy… 911 2000 04:00 2000-10-07
## 4 Brock, C… A Country … 2000 03:54 2000-01-01
## 5 Clark, T… A Little G… 2000 03:07 2000-12-16
## 6 Son By F… A Puro Dol… 2000 03:30 2000-04-08
## 7 Carter, … Aaron's Pa… 2000 03:23 2000-08-26
## 8 Nine Days Absolutely… 2000 03:09 2000-05-06
## 9 De La So… All Good? 2000 05:02 2000-12-23
## 10 Blink-182 All The Sm… 2000 02:52 1999-12-04
## # … with 307 more rows
The unique
function removes any duplicate rows in a table. That’s how we have a single row for each song.
Next, we would like to remove all the song information from the rank table. But we need to do it in a way that still remembers which song each ranking observation corresponds to. To do that, let’s first give each song an identifier that we can use to link songs and rankings. So, we can produce the final version of our song table like this:
song <- tidy_billboard %>%
dplyr::select(artist, track, year, time, date.entered) %>%
unique() %>%
mutate(song_id = row_number())
song
## # A tibble: 317 x 6
## artist track year time date.entered song_id
## <chr> <chr> <dbl> <tim> <date> <int>
## 1 Nelly (Hot… 2000 04:17 2000-04-29 1
## 2 Nu Fla… 3 Li… 2000 03:54 2000-06-03 2
## 3 Jean, … 911 2000 04:00 2000-10-07 3
## 4 Brock,… A Co… 2000 03:54 2000-01-01 4
## 5 Clark,… A Li… 2000 03:07 2000-12-16 5
## 6 Son By… A Pu… 2000 03:30 2000-04-08 6
## 7 Carter… Aaro… 2000 03:23 2000-08-26 7
## 8 Nine D… Abso… 2000 03:09 2000-05-06 8
## 9 De La … All … 2000 05:02 2000-12-23 9
## 10 Blink-… All … 2000 02:52 1999-12-04 10
## # … with 307 more rows
The mutate
function adds a new column to the table, in this case with column name song_id
and value the row number the song appears in the table (from the row_number
column).
Now we can make a rank table, we combine the tidy billboard table with our new song table using a join
(we’ll learn all about joins later). It checks the values on each row of the billboard table and looks for rows in the song table that have the exact same values, and makes a new row that combines the information from both tables.
## # A tibble: 5,307 x 8
## year artist track time date.entered week
## <dbl> <chr> <chr> <tim> <date> <chr>
## 1 2000 Nelly (Hot… 04:17 2000-04-29 wk1
## 2 2000 Nelly (Hot… 04:17 2000-04-29 wk2
## 3 2000 Nelly (Hot… 04:17 2000-04-29 wk3
## 4 2000 Nelly (Hot… 04:17 2000-04-29 wk4
## 5 2000 Nelly (Hot… 04:17 2000-04-29 wk5
## 6 2000 Nelly (Hot… 04:17 2000-04-29 wk6
## 7 2000 Nelly (Hot… 04:17 2000-04-29 wk7
## 8 2000 Nelly (Hot… 04:17 2000-04-29 wk8
## 9 2000 Nelly (Hot… 04:17 2000-04-29 wk9
## 10 2000 Nelly (Hot… 04:17 2000-04-29 wk10
## # … with 5,297 more rows, and 2 more variables:
## # rank <dbl>, song_id <int>
That adds the song_id
variable to the tidy_billboard
table. So now we can remove the song information and only keep ranking information and the song_id
.
rank <- tidy_billboard %>%
left_join(song, c("artist", "year", "track", "time", "date.entered")) %>%
dplyr::select(song_id, week, rank)
rank
## # A tibble: 5,307 x 3
## song_id week rank
## <int> <chr> <dbl>
## 1 1 wk1 100
## 2 1 wk2 99
## 3 1 wk3 96
## 4 1 wk4 76
## 5 1 wk5 55
## 6 1 wk6 37
## 7 1 wk7 24
## 8 1 wk8 24
## 9 1 wk9 30
## 10 1 wk10 36
## # … with 5,297 more rows
Challenge: Let’s do a little better job at tidying the billboard dataset:
- When using
gather
to make theweek
andrank
columns, remove any weeks where the song does not appear in the top 100. This is coded as missing (NA
). See thena.rm
argument togather
.
- Make
week
a numeric variable (i.e., removewk
). See what theextract_numeric
function does.
- Instead of
date.entered
add adate
column that states the actual date of each ranking. See how R deals with dates?Date
and how you can turn a string into aDate
usingas.Date
.
- Sort the resulting table by date and rank.
- Make new
song
andrank
tables.song
will now not have thedate.entered
column, andrank
will have the newdate
column you have just created.