12 SQL I: Single Table Queries

The Structured-Query-Language (SQL) is the predominant language used in database systems. It is tailored to the Relational data representation model.

SQL is a declarative language, we don’t write a procedure to compute a relation, we declare what the relation we want to compute looks like. The actual execution is determined and optimized by the database engine. However, there are clear mappings between parts of SQL queries and the operations we have defined so far as implemented in the tidyverse.

The basic construct in SQL is the so-called SFW construct: select-from-where which specifies:

  • select: which attributes you want the answer to have
  • from: which relation (table) you want the answer to be computed from
  • where: what conditions you want to be satisfied by the rows (tuples) of the answer

E.g.: movies produced by disney in 1990: note the rename

The select clause can contain expressions (this is paralleled by the mutate operation we saw previously)

  • select title || ' (' || to_char(year) || ')' as titleyear
  • select 2014 - year

The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter operation)

  • year between 1990 and 1995
  • title like 'star wars%' title like 'star wars _'

We can include ordering, e.g., find distinct movies sorted by title

12.1 Group-by and summarize

SQL has an idiom for grouping and summarizing (conditioning as we called it before). Remember this is a very important concept that shows up in many data processing platforms

  • What it does: Partition the tuples by the group attributes (year in this case), and do something (compute avg in this case) for each group
  • Number of resulting tuples == Number of groups

E.g., compute the average movie length by year

12.2 Subqueries

You can nest queries as an expression in an SFW query. We refer to these “subqueries” as “nested subquery”:

E.g., find movie with the maximum length

E.g., find movies with at least 5 stars: an example of a correlated subquery

The nested subquery counts the number of actors for that movie.

E.g., rank movies by their length.

Key insight: A movie is ranked 5th if there are exactly 4 movies with longer length. Most database systems support some sort of a rank keyword for doing this. Notice that the above query doesn’t work in presence of ties etc.