Here we show SQL constructs using the Lahman baseball dataset (downloaded from https://github.com/jknecht/baseball-archive-sqlite). We also show how to use a SQL database inside Rmarkdown.

First, we create a connection to the database. In this case, we are using a SQLite database. A good system to prototype database designs. To make the most of a database system, one would use some of the more powerful products: Oracle, Microsoft SQLServer, MySQL (MariaDB), PostgreSQL or other. In all cases, the way to create a connection to the server from Rmarkdown is the same.

library(DBI)
db <- dbConnect(RSQLite::SQLite(), dbname="../lecture-notes/data/lahman2016.sqlite")

Select-From-Where

First, we write a query to get batting statistics for Washington Nationals in 2010: note the rename. Note that the chunk in the markdown file is declared as {sql, connection=db} in the Rmarkdown file:

SELECT b.playerId, b.yearId, b.H, b.AB
FROM BATTING AS b
WHERE teamID = 'WAS' AND yearID = 2010
Displaying records 1 - 10
playerID yearID H AB
atilalu01 2010 1 25
balesco01 2010 0 0
batismi01 2010 1 8
bergmja01 2010 0 0
bernaro01 2010 102 414
bisenjo01 2010 0 0
brunebr01 2010 0 0
burkeja02 2010 0 0
burnese01 2010 0 0
cappsma01 2010 0 1

Expressions

The select clause can contain expressions (this is paralleled by the mutate operation we saw previously)

SELECT b.playerId, b.yearId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
Displaying records 1 - 10
playerID yearID AB BP
aardsda01 2004 0 NA
aardsda01 2006 2 0.0000000
aardsda01 2007 0 NA
aardsda01 2008 1 0.0000000
aardsda01 2009 0 NA
aardsda01 2010 0 NA
aardsda01 2012 0 NA
aardsda01 2013 0 NA
aardsda01 2015 1 0.0000000
aaronha01 1954 468 0.2799145

WHERE predicates

The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter operation)

SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010 AND 
  b.teamID LIKE 'NY%'
Displaying records 1 - 10
playerID yearID teamID AB BP
abreubo01 2006 NYA 209 0.3301435
abreubo01 2007 NYA 605 0.2826446
abreubo01 2008 NYA 609 0.2955665
aceveal01 2009 NYA 2 0.0000000
agbaybe01 2001 NYN 296 0.2770270
aguilch01 2008 NYN 12 0.1666667
alfoned01 2001 NYN 457 0.2428884
alfoned01 2002 NYN 490 0.3081633
almoned01 2003 NYN 1 0.0000000
almoner01 2001 NYA 4 0.5000000

ORDERING

We can include ordering (parallel to arrange)

SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010 AND 
  b.teamID LIKE 'NY%'
ORDER BY b.AB DESC, BP DESC;
Displaying records 1 - 10
playerID yearID teamID AB BP
soriaal01 2002 NYA 696 0.3002874
reyesjo01 2005 NYN 696 0.2729885
reyesjo01 2008 NYN 688 0.2965116
soriaal01 2003 NYA 682 0.2903226
reyesjo01 2007 NYN 681 0.2804699
jeterde01 2005 NYA 654 0.3088685
reyesjo01 2006 NYN 647 0.2998454
jeterde01 2002 NYA 644 0.2965839
jeterde01 2004 NYA 643 0.2923795
jeterde01 2007 NYA 639 0.3223787

Group_by and Summarize

SELECT b.teamId, b.yearId, avg(1.0 * b.H / b.AB) AS AVE_BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010
GROUP BY b.teamId, b.yearId
ORDER BY AVE_BP DESC
Displaying records 1 - 10
teamID yearID AVE_BP
TBA 2007 0.2892067
MIN 2008 0.2875753
SEA 2007 0.2845483
SLN 2003 0.2832984
MIN 2001 0.2791984
SEA 2002 0.2738639
CLE 2006 0.2728260
TOR 2003 0.2648080
KCA 2008 0.2633696
CHA 2004 0.2603735
Batting %>%
  filter(AB > 0, ...) %>%
  group_by(teamId, yearId) %>%
  summarize(ave_bp = mean(1.0 * H / AB))

Subqueries

Sometimes it’s easier to nest queries like the one above into query and subquery

SELECT teamID, yearID, avg(BP) AS AVG_BP
FROM (SELECT b.teamId, b.yearId, 1.0 * b.H / b.AB AS BP
      FROM BATTING AS b
      WHERE b.AB > 0 AND
        b.yearID > 2000 AND
        b.yearID < 2010)
GROUP BY teamID, yearID
ORDER BY AVG_BP DESC;
Displaying records 1 - 10
teamID yearID AVG_BP
TBA 2007 0.2892067
MIN 2008 0.2875753
SEA 2007 0.2845483
SLN 2003 0.2832984
MIN 2001 0.2791984
SEA 2002 0.2738639
CLE 2006 0.2728260
TOR 2003 0.2648080
KCA 2008 0.2633696
CHA 2004 0.2603735

Joins

List all players from California, playing in 2015

library(Lahman)
library(tidyverse)
## ── Attaching packages ───────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ──────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Batting %>%
  inner_join(Master, by="playerID") %>%
  filter(yearID == "2015", birthState == "CA")
select b.playerId, b.teamId, m.birthState
from Batting as b join master as m on b.playerId == m.playerId
where yearId = "2015" and m.birthState = "CA"
Displaying records 1 - 10
playerID teamID birthState
alexasc01 KCA CA
anderbr05 OAK CA
anderco01 CLE CA
andrima01 TBA CA
arenano01 COL CA
axelrdy01 CIN CA
balesco01 CIN CA
bandyje01 LAA CA
barneau01 LAN CA
barnebr02 COL CA

Finally, we close the connection to the database:

dbDisconnect(db)