12 SQL I: Single Table Queries
The Structured-Query-Language (SQL) is the predominant language used in database systems. It is tailored to the Relational data representation model.
SQL is a declarative language, we don’t write a procedure to compute a relation, we declare what the relation we want to compute looks like.
The actual execution is determined and optimized by the database engine. However, there are clear mappings between
parts of SQL queries and the operations we have defined so far as implemented in the tidyverse
.
The basic construct in SQL is the so-called SFW
construct: select-from-where which specifies:
- select: which attributes you want the answer to have
- from: which relation (table) you want the answer to be computed from
- where: what conditions you want to be satisfied by the rows (tuples) of the answer
E.g.: movies produced by disney in 1990: note the rename
The select clause can contain expressions (this is paralleled by the mutate
operation we saw previously)
select title || ' (' || to_char(year) || ')' as titleyear
select 2014 - year
The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter
operation)
year between 1990 and 1995
title like 'star wars%'
title like 'star wars _'
We can include ordering, e.g., find distinct movies sorted by title
12.1 Group-by and summarize
SQL has an idiom for grouping and summarizing (conditioning as we called it before). Remember this is a very important concept that shows up in many data processing platforms
- What it does: Partition the tuples by the group attributes (year in this case), and do something (compute avg in this case) for each group
- Number of resulting tuples == Number of groups
E.g., compute the average movie length by year
12.2 Subqueries
You can nest queries as an expression in an SFW query. We refer to these “subqueries” as “nested subquery”:
E.g., find movie with the maximum length
E.g., find movies with at least 5 stars: an example of a correlated subquery
select *
from movies m
where 5 >= (select count(*)
from starsIn si
where si.title = m.title and si.year = m.year);
The nested subquery counts the number of actors for that movie.
E.g., rank movies by their length.
select title, year, (select count(*)
from movies m2
where m1.length <= m2.length) as rank
from movies m1;
Key insight: A movie is ranked 5th if there are exactly 4 movies with longer length. Most database systems support some sort of a rank keyword for doing this. Notice that the above query doesn’t work in presence of ties etc.