Here we show SQL constructs using the Lahman baseball dataset (downloaded from https://github.com/jknecht/baseball-archive-sqlite). We also show how to use a SQL database inside Rmarkdown.
First, we create a connection to the database. In this case, we are using a SQLite
database. A good system to prototype database designs. To make the most of a database system, one would use some of the more powerful products: Oracle, Microsoft SQLServer, MySQL (MariaDB), PostgreSQL or other. In all cases, the way to create a connection to the server from Rmarkdown is the same.
library(DBI)
db <- dbConnect(RSQLite::SQLite(), dbname="../lecture-notes/data/lahman2016.sqlite")
First, we write a query to get batting statistics for Washington Nationals in 2010: note the rename. Note that the chunk in the markdown file is declared as {sql, connection=db}
in the Rmarkdown file:
SELECT b.playerId, b.yearId, b.H, b.AB
FROM BATTING AS b
WHERE teamID = 'WAS' AND yearID = 2010
playerID | yearID | H | AB |
---|---|---|---|
atilalu01 | 2010 | 1 | 25 |
balesco01 | 2010 | 0 | 0 |
batismi01 | 2010 | 1 | 8 |
bergmja01 | 2010 | 0 | 0 |
bernaro01 | 2010 | 102 | 414 |
bisenjo01 | 2010 | 0 | 0 |
brunebr01 | 2010 | 0 | 0 |
burkeja02 | 2010 | 0 | 0 |
burnese01 | 2010 | 0 | 0 |
cappsma01 | 2010 | 0 | 1 |
The select clause can contain expressions (this is paralleled by the mutate
operation we saw previously)
SELECT b.playerId, b.yearId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
playerID | yearID | AB | BP |
---|---|---|---|
aardsda01 | 2004 | 0 | NA |
aardsda01 | 2006 | 2 | 0.0000000 |
aardsda01 | 2007 | 0 | NA |
aardsda01 | 2008 | 1 | 0.0000000 |
aardsda01 | 2009 | 0 | NA |
aardsda01 | 2010 | 0 | NA |
aardsda01 | 2012 | 0 | NA |
aardsda01 | 2013 | 0 | NA |
aardsda01 | 2015 | 1 | 0.0000000 |
aaronha01 | 1954 | 468 | 0.2799145 |
The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter
operation)
SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
b.yearID > 2000 AND
b.yearID < 2010 AND
b.teamID LIKE 'NY%'
playerID | yearID | teamID | AB | BP |
---|---|---|---|---|
abreubo01 | 2006 | NYA | 209 | 0.3301435 |
abreubo01 | 2007 | NYA | 605 | 0.2826446 |
abreubo01 | 2008 | NYA | 609 | 0.2955665 |
aceveal01 | 2009 | NYA | 2 | 0.0000000 |
agbaybe01 | 2001 | NYN | 296 | 0.2770270 |
aguilch01 | 2008 | NYN | 12 | 0.1666667 |
alfoned01 | 2001 | NYN | 457 | 0.2428884 |
alfoned01 | 2002 | NYN | 490 | 0.3081633 |
almoned01 | 2003 | NYN | 1 | 0.0000000 |
almoner01 | 2001 | NYA | 4 | 0.5000000 |
We can include ordering (parallel to arrange
)
SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
b.yearID > 2000 AND
b.yearID < 2010 AND
b.teamID LIKE 'NY%'
ORDER BY b.AB DESC, BP DESC;
playerID | yearID | teamID | AB | BP |
---|---|---|---|---|
soriaal01 | 2002 | NYA | 696 | 0.3002874 |
reyesjo01 | 2005 | NYN | 696 | 0.2729885 |
reyesjo01 | 2008 | NYN | 688 | 0.2965116 |
soriaal01 | 2003 | NYA | 682 | 0.2903226 |
reyesjo01 | 2007 | NYN | 681 | 0.2804699 |
jeterde01 | 2005 | NYA | 654 | 0.3088685 |
reyesjo01 | 2006 | NYN | 647 | 0.2998454 |
jeterde01 | 2002 | NYA | 644 | 0.2965839 |
jeterde01 | 2004 | NYA | 643 | 0.2923795 |
jeterde01 | 2007 | NYA | 639 | 0.3223787 |
SELECT b.teamId, b.yearId, avg(1.0 * b.H / b.AB) AS AVE_BP
FROM BATTING AS b
WHERE b.AB > 0 AND
b.yearID > 2000 AND
b.yearID < 2010
GROUP BY b.teamId, b.yearId
ORDER BY AVE_BP DESC
teamID | yearID | AVE_BP |
---|---|---|
TBA | 2007 | 0.2892067 |
MIN | 2008 | 0.2875753 |
SEA | 2007 | 0.2845483 |
SLN | 2003 | 0.2832984 |
MIN | 2001 | 0.2791984 |
SEA | 2002 | 0.2738639 |
CLE | 2006 | 0.2728260 |
TOR | 2003 | 0.2648080 |
KCA | 2008 | 0.2633696 |
CHA | 2004 | 0.2603735 |
Batting %>%
filter(AB > 0, ...) %>%
group_by(teamId, yearId) %>%
summarize(ave_bp = mean(1.0 * H / AB))
Sometimes it’s easier to nest queries like the one above into query and subquery
SELECT teamID, yearID, avg(BP) AS AVG_BP
FROM (SELECT b.teamId, b.yearId, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
b.yearID > 2000 AND
b.yearID < 2010)
GROUP BY teamID, yearID
ORDER BY AVG_BP DESC;
teamID | yearID | AVG_BP |
---|---|---|
TBA | 2007 | 0.2892067 |
MIN | 2008 | 0.2875753 |
SEA | 2007 | 0.2845483 |
SLN | 2003 | 0.2832984 |
MIN | 2001 | 0.2791984 |
SEA | 2002 | 0.2738639 |
CLE | 2006 | 0.2728260 |
TOR | 2003 | 0.2648080 |
KCA | 2008 | 0.2633696 |
CHA | 2004 | 0.2603735 |
List all players from California, playing in 2015
library(Lahman)
library(tidyverse)
## ── Attaching packages ───────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Batting %>%
inner_join(Master, by="playerID") %>%
filter(yearID == "2015", birthState == "CA")
select b.playerId, b.teamId, m.birthState
from Batting as b join master as m on b.playerId == m.playerId
where yearId = "2015" and m.birthState = "CA"
playerID | teamID | birthState |
---|---|---|
alexasc01 | KCA | CA |
anderbr05 | OAK | CA |
anderco01 | CLE | CA |
andrima01 | TBA | CA |
arenano01 | COL | CA |
axelrdy01 | CIN | CA |
balesco01 | CIN | CA |
bandyje01 | LAA | CA |
barneau01 | LAN | CA |
barnebr02 | COL | CA |
Finally, we close the connection to the database:
dbDisconnect(db)