13 Two-table operations
So far we have looked at data operations defined over single tables and data frames. In this section we look at efficient methods to combine data from multiple tables. The fundamental operation here is the join
, which is a workhorse of database system design and impementation. The join
operation combines rows from two tables to create a new single table, based on matching criteria specified over attributes of each of the two tables.
Consider the example of joining the flights
and airlines
table:
Let’s take a look at the flights
table again:
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 14 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
And add the airlines
table:
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
Here, we want to add airline information to each flight. We can do so by joining the attributes of the respective airline from the airlines
table with the flights
table based on the values of attributes flights$carrier
and airlines$carrier
.
Specifically, every row of flights
with a specific value for flights$carrier
, is joined with the the corresponding row in airlines
with the same value for airlines$carrier
. We will see four different ways of performing this operation that differ on how non-matching observations are handled.
13.1 Left Join
In a left join
, all observations on left operand (LHS) are retained:
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 15 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, name <chr>
RHS variables for LHS observations with no matching RHS observations are coded as NA
.
13.2 Right Join
All observations on right operand (RHS) are retained:
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 810 810
## 2 2013 1 1 1451 1500
## 3 2013 1 1 1452 1455
## 4 2013 1 1 1454 1500
## 5 2013 1 1 1507 1515
## 6 2013 1 1 1530 1530
## 7 2013 1 1 1546 1540
## 8 2013 1 1 1550 1550
## 9 2013 1 1 1552 1600
## 10 2013 1 1 1554 1600
## # … with 336,766 more rows, and 15 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, name <chr>
LHS variables for RHS observations with no matching LHS observations are coded as NA
.
13.3 Inner Join
Only observations matching on both tables are retained
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 15 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, name <chr>
13.4 Full Join
All observations are retained, regardless of matching condition
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515
## 2 2013 1 1 533 529
## 3 2013 1 1 542 540
## 4 2013 1 1 544 545
## 5 2013 1 1 554 600
## 6 2013 1 1 554 558
## 7 2013 1 1 555 600
## 8 2013 1 1 557 600
## 9 2013 1 1 557 600
## 10 2013 1 1 558 600
## # … with 336,766 more rows, and 15 more
## # variables: dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, name <chr>
All values coded as NA
for non-matching observations as appropriate.
13.5 Join conditions
All join operations are based on a matching condition:
specifies to join observations where flights$carrier
equals airlines$carrier
.
In this case, where no conditions are specified using the by
argument:
a natural join is perfomed. In this case all variables with the same name in both tables are used in join condition.
You can also specify join conditions on arbitrary attributes using the by
argument.
13.6 Filtering Joins
We’ve just seen mutating joins that create new tables. Filtering joins use join conditions to filter a specific table.
## # A tibble: 0 x 19
## # … with 19 variables: year <int>, month <int>,
## # day <int>, dep_time <int>,
## # sched_dep_time <int>, dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Filters the flights
table to only include flights from airlines that
are not included in the airlines
table.
13.7 SQL Constructs: Multi-table Queries
Key idea: - Do a join to get an appropriate table - Use the constructs for single-table queries
You will get used to doing all at once.
For the first part, where we use a join to get an appropriate table, the general SQL construct includes: - The name of the first table to join - The type of join to do - The name of the second table to join - The join condition(s)
- Examples:
select title, year, me.name as producerName
from movies m join movieexec me
where m.producer = me.id;
- Consider the query:
select title, year, producer, count(starName)
from movies join starsIn
where title = starsIn.movieTitle and year = starsIn.movieYear
group by title, year, producer
- What about movies with no stars ?
- Need to use outer joins
select title, year, producer, count(starName)
from movies left outer join starsIn
on title = starsIn.movieTitle and year = starsIn.movieYear
group by title, year, producer
As we saw before, all tuples from ‘movies’ that have no matches in starsIn are included with NULLs (in dplyr
this was NA
).
So, if a tuple (m1, 1990)
has no match in starsIn
, we get (m1, 1990, NULL)
in the result and the count(starName)
works correctly then. Note however that count(*)
would not work correctly (NULLs can have unintuitive behavior)
In most systems JOIN
corresponds to an inner join, and include LEFT JOIN
and RIGHT JOIN
as well.