18 Text and Dates

In this chapter we briefly discuss common patterns to handle text and date data and point to useful resources.

18.1 Text

Frequently, data scraped or ingested will contain text that will need to be processed to either extract data, correct errors or resolve duplicate records. In this section we will look at a few common patterns: 1) tools for string operations, 2) tools using regular expressions, and 3) deriving attributes from text. For further reading consult: http://r4ds.had.co.nz/strings.html

18.1.1 String operations

The stringr package contains a number of useful, commonly used string manipulation operations.

library(tidyverse)
library(stringr)

short_string <- "I love Spring"
long_string <- "There's is nothing I love more than 320 in the Spring"

Here are a few common ones:

string length: str_len

str_length(c(short_string, long_string))

## [1] 13 53

combining strings: str_c

str_c(short_string, long_string, sep=". ")

## [1] "I love Spring. There's is nothing I love more than 320 in the Spring"

subsetting strings: str_sub

str_sub(c(short_string, long_string), 2, 5)

## [1] " lov" "here"

trim strings: str_trim

str_trim("    I am padded    ", side="both")

## [1] "I am padded"

18.1.2 Regular expressions

By far, the most powerful tools for extracting and cleaning text data are regular expressions. The stringr package provides a great number of tools based on regular expression matching.

First, some basics

strs <- c("apple", "banana", "pear")
str_view(strs, "an")

Match any character: .
Match the ‘dot’ character: \\.

str_view(strs, ".a.")

str_view(c(strs, "a.c"), "a\\.c")

Anchor start (^), end ($)

str_view(strs, "^a")

str_view(strs, "a$")

str_view(c("apple pie", "apple", "apple cake"), "apple")

str_view(c("apple pie", "apple", "apple cake"), "^apple$")

Character classes and alternatives
\d: match any digit
\s: match any whitespace (e.g., space, tab, newline)
[abc]: match set of characters (e.g, a, b, or c)
[^abc]: match anything except this set of characters
|: match any of one or more patterns

Match vowels or digits

str_view(c("t867nine", "gray9"), "[aeiou]|[0-9]")

Repetition
?: zero or one
+: one or more
*: zero or more

str_view(c("color", "colour"), "colou?r")

Grouping and backreferences

Parentheses define groups, which can be referenced using \1, \2, etc.

fruit <- c("banana", "coconut", "cucumber", "jujube", "papaya", "salal berry")
str_view(fruit, "(..)\\1")

18.1.3 Tools using regular expressions

Determine which strings match a pattern: str_detect: given a vector of strings, return TRUE for those that match a regular expression, FALSE otherwise

data(words)

print(head(words))

## [1] "a"        "able"     "about"    "absolute"
## [5] "accept"   "account"

data_frame(word=words, result=str_detect(words, "^[aeiou]")) %>% sample_n(30)

## # A tibble: 30 x 2
##    word     result
##    <chr>    <lgl> 
##  1 as       TRUE  
##  2 bottom   FALSE 
##  3 letter   FALSE 
##  4 here     FALSE 
##  5 thousand FALSE 
##  6 love     FALSE 
##  7 town     FALSE 
##  8 keep     FALSE 
##  9 actual   TRUE  
## 10 front    FALSE 
## # … with 20 more rows

Similarly, str_count returns the number of matches in a string instead of just TRUE or FALSE

Filter string vectors to include only those that match a regular expression

data(sentences)
print(head(sentences))

## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
colors_re <- str_c(colors, collapse="|")
print(colors_re)

## [1] "red|orange|yellow|green|blue|purple"

sentences_with_color <- str_subset(sentences, colors_re) %>% head(10)
str_view_all(sentences_with_color, colors_re)

Extracting matches: str_extract, str_extract_all

str_extract(sentences_with_color, colors_re)

##  [1] "blue"   "blue"   "red"    "red"    "red"   
##  [6] "blue"   "yellow" "red"    "red"    "green"

Grouped matches: str_match

noun_re <- "(a|the) ([^ ]+)"
noun_matches <- sentences %>%
  str_subset(noun_re) %>%
  str_match(noun_re) %>%
  head(10)
noun_matches

##       [,1]         [,2]  [,3]     
##  [1,] "the smooth" "the" "smooth" 
##  [2,] "the sheet"  "the" "sheet"  
##  [3,] "the depth"  "the" "depth"  
##  [4,] "a chicken"  "a"   "chicken"
##  [5,] "the parked" "the" "parked" 
##  [6,] "the sun"    "the" "sun"    
##  [7,] "the huge"   "the" "huge"   
##  [8,] "the ball"   "the" "ball"   
##  [9,] "the woman"  "the" "woman"  
## [10,] "a helps"    "a"   "helps"

The result is a string matrix, with one row for each string in the input vector. The first column includes the complete match to the regular expression (just like str_extract), the remaining columns has the matches for the groups defined in the pattern. To extract the first group matches one would index one of the columns. For example, the matches for the second group are

noun_matches[,3]

##  [1] "smooth"  "sheet"   "depth"   "chicken"
##  [5] "parked"  "sun"     "huge"    "ball"   
##  [9] "woman"   "helps"

Splitting strings: str_split split strings in a vector based on a match. For instance, to split sentences into words:

sentences %>%
  head(5) %>%
  str_split(" ")

## [[1]]
## [1] "The"     "birch"   "canoe"   "slid"   
## [5] "on"      "the"     "smooth"  "planks."
## 
## [[2]]
## [1] "Glue"        "the"         "sheet"      
## [4] "to"          "the"         "dark"       
## [7] "blue"        "background."
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"  
## [6] "depth" "of"    "a"     "well."
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken"
## [5] "leg"     "is"      "a"       "rare"   
## [9] "dish."  
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"    
## [6] "round"  "bowls."

18.1.4 Extracting attributes from text

Handling free text in data pipelines and or statistical models is tricky. Frequently we extract attributes from text in order to perform analysis.

We draw from https://www.tidytextmining.com/tidytext.html for this discussion.

We usually think of text datasets (called a text corpus) in terms of

documents: the instances of free text in our dataset, and
terms the specific, e.g., words, they contain.

In terms of the representation models we have used so far, we can think of documents as entities, described by attributes based on words, or words as entitites, described by attributes based on documents. To tidy text data, we tend to create one-token-per-row data frames that list the instances of terms in documents in a dataset

Here’s a simple example using Jane Austen text

library(janeaustenr)
library(tidyverse)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case=TRUE)))) %>%
  ungroup()

original_books

## # A tibble: 73,422 x 4
##    text           book         linenumber chapter
##    <chr>          <fct>             <int>   <int>
##  1 "SENSE AND SE… Sense & Sen…          1       0
##  2 ""             Sense & Sen…          2       0
##  3 "by Jane Aust… Sense & Sen…          3       0
##  4 ""             Sense & Sen…          4       0
##  5 "(1811)"       Sense & Sen…          5       0
##  6 ""             Sense & Sen…          6       0
##  7 ""             Sense & Sen…          7       0
##  8 ""             Sense & Sen…          8       0
##  9 ""             Sense & Sen…          9       0
## 10 "CHAPTER 1"    Sense & Sen…         10       1
## # … with 73,412 more rows

Let’s re-structure it as a one-token-per-row column using the unnest_tokens function in the tidytext package

library(tidytext)

tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books

## # A tibble: 725,055 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Sense & Sensibil…          1       0 sense    
##  2 Sense & Sensibil…          1       0 and      
##  3 Sense & Sensibil…          1       0 sensibil…
##  4 Sense & Sensibil…          3       0 by       
##  5 Sense & Sensibil…          3       0 jane     
##  6 Sense & Sensibil…          3       0 austen   
##  7 Sense & Sensibil…          5       0 1811     
##  8 Sense & Sensibil…         10       1 chapter  
##  9 Sense & Sensibil…         10       1 1        
## 10 Sense & Sensibil…         13       1 the      
## # … with 725,045 more rows

Let’s remove stop words from the data frame

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words, by="word")

tidy_books

## # A tibble: 217,609 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Sense & Sensibil…          1       0 sense    
##  2 Sense & Sensibil…          1       0 sensibil…
##  3 Sense & Sensibil…          3       0 jane     
##  4 Sense & Sensibil…          3       0 austen   
##  5 Sense & Sensibil…          5       0 1811     
##  6 Sense & Sensibil…         10       1 chapter  
##  7 Sense & Sensibil…         10       1 1        
##  8 Sense & Sensibil…         13       1 family   
##  9 Sense & Sensibil…         13       1 dashwood 
## 10 Sense & Sensibil…         13       1 settled  
## # … with 217,599 more rows

Now, we can use this dataset to compute attributes for entities of interest. For instance, let’s create a data frame with words as entities, with an attribute containing the number of times the word appears in this corpus

frequent_words <- tidy_books %>%
  count(word, sort=TRUE) %>%
  filter(n > 600)

Which can then use like other data frames as we have used previously. For example to plot most frequent words:

frequent_words %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x=word, y=n)) +
    geom_col() +
    theme_bw() +
    labs(x=NULL, y="frequency") +
    coord_flip()

18.2 Handling dates

The lubridate package provides common operations for parsing and operating on dates and times. See http://r4ds.had.co.nz/dates-and-times.html for more information.

A number of functions for parsing dates in a variety of formats are provided, along with functions to extract specific components from parsed date objects

library(lubridate)

datetime <- ymd_hms("2016-07-08 12:34:56")

year(datetime)

## [1] 2016

month(datetime)

## [1] 7

day(datetime)

## [1] 8

mday(datetime)

## [1] 8

yday(datetime)

## [1] 190

wday(datetime)

## [1] 6

They can also return month and day of the week names, abbreviated, as ordered factors

month(datetime, label=TRUE)

## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < ... < Dec

We can also create attributes of type datetime from string attributes. Here’s an example using the flights dataset

flights_with_dt <- flights %>%
  mutate(dep_dt=make_datetime(year, month, day, dep_time %/% 100, dep_time %% 100)) %>%
  dplyr::select(year, month, day, dep_time, dep_dt)
flights_with_dt

## # A tibble: 336,776 x 5
##     year month   day dep_time dep_dt             
##    <int> <int> <int>    <int> <dttm>             
##  1  2013     1     1      517 2013-01-01 05:17:00
##  2  2013     1     1      533 2013-01-01 05:33:00
##  3  2013     1     1      542 2013-01-01 05:42:00
##  4  2013     1     1      544 2013-01-01 05:44:00
##  5  2013     1     1      554 2013-01-01 05:54:00
##  6  2013     1     1      554 2013-01-01 05:54:00
##  7  2013     1     1      555 2013-01-01 05:55:00
##  8  2013     1     1      557 2013-01-01 05:57:00
##  9  2013     1     1      557 2013-01-01 05:57:00
## 10  2013     1     1      558 2013-01-01 05:58:00
## # … with 336,766 more rows

With this attribute in place we can extract day of the week and plot the number of flights per day of the week

flights_with_dt %>%
  mutate(wday=wday(dep_dt, label=TRUE)) %>%
  ggplot(aes(x=wday)) +
    geom_bar()