SQL Examples with Baseball

Here we show SQL constructs using the Lahman baseball dataset (downloaded from https://github.com/jknecht/baseball-archive-sqlite). We also show how to use a SQL database inside Rmarkdown.

First, we create a connection to the database. In this case, we are using a SQLite database. A good system to prototype database designs. To make the most of a database system, one would use some of the more powerful products: Oracle, Microsoft SQLServer, MySQL (MariaDB), PostgreSQL or other. In all cases, the way to create a connection to the server from Rmarkdown is the same.

library(DBI)
db <- dbConnect(RSQLite::SQLite(), dbname="../lecture-notes/data/lahman2016.sqlite")

Select-From-Where

First, we write a query to get batting statistics for Washington Nationals in 2010: note the rename. Note that the chunk in the markdown file is declared as {sql, connection=db} in the Rmarkdown file:

SELECT b.playerId, b.yearId, b.H, b.AB
FROM BATTING AS b
WHERE teamID = 'WAS' AND yearID = 2010

Displaying records 1 - 10
playerID	yearID	H	AB
atilalu01	2010	1	25
balesco01	2010	0	0
batismi01	2010	1	8
bergmja01	2010	0	0
bernaro01	2010	102	414
bisenjo01	2010	0	0
brunebr01	2010	0	0
burkeja02	2010	0	0
burnese01	2010	0	0
cappsma01	2010	0	1

Expressions

The select clause can contain expressions (this is paralleled by the mutate operation we saw previously)

SELECT b.playerId, b.yearId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b

Displaying records 1 - 10
playerID	yearID	AB	BP
aardsda01	2004	0	NA
aardsda01	2006	2	0.0000000
aardsda01	2007	0	NA
aardsda01	2008	1	0.0000000
aardsda01	2009	0	NA
aardsda01	2010	0	NA
aardsda01	2012	0	NA
aardsda01	2013	0	NA
aardsda01	2015	1	0.0000000
aaronha01	1954	468	0.2799145

WHERE predicates

The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter operation)

SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010 AND 
  b.teamID LIKE 'NY%'

Displaying records 1 - 10
playerID	yearID	teamID	AB	BP
abreubo01	2006	NYA	209	0.3301435
abreubo01	2007	NYA	605	0.2826446
abreubo01	2008	NYA	609	0.2955665
aceveal01	2009	NYA	2	0.0000000
agbaybe01	2001	NYN	296	0.2770270
aguilch01	2008	NYN	12	0.1666667
alfoned01	2001	NYN	457	0.2428884
alfoned01	2002	NYN	490	0.3081633
almoned01	2003	NYN	1	0.0000000
almoner01	2001	NYA	4	0.5000000

ORDERING

We can include ordering (parallel to arrange)

SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010 AND 
  b.teamID LIKE 'NY%'
ORDER BY b.AB DESC, BP DESC;

Displaying records 1 - 10
playerID	yearID	teamID	AB	BP
soriaal01	2002	NYA	696	0.3002874
reyesjo01	2005	NYN	696	0.2729885
reyesjo01	2008	NYN	688	0.2965116
soriaal01	2003	NYA	682	0.2903226
reyesjo01	2007	NYN	681	0.2804699
jeterde01	2005	NYA	654	0.3088685
reyesjo01	2006	NYN	647	0.2998454
jeterde01	2002	NYA	644	0.2965839
jeterde01	2004	NYA	643	0.2923795
jeterde01	2007	NYA	639	0.3223787

Group_by and Summarize

What it does: Partition the tuples by the group attributes (teamId and yearId in this case), and do something (compute avg in this case) for each group
Number of resulting tuples == Number of groups

SELECT b.teamId, b.yearId, avg(1.0 * b.H / b.AB) AS AVE_BP
FROM BATTING AS b
WHERE b.AB > 0 AND
  b.yearID > 2000 AND
  b.yearID < 2010
GROUP BY b.teamId, b.yearId
ORDER BY AVE_BP DESC

Displaying records 1 - 10
teamID	yearID	AVE_BP
TBA	2007	0.2892067
MIN	2008	0.2875753
SEA	2007	0.2845483
SLN	2003	0.2832984
MIN	2001	0.2791984
SEA	2002	0.2738639
CLE	2006	0.2728260
TOR	2003	0.2648080
KCA	2008	0.2633696
CHA	2004	0.2603735

Batting %>%
  filter(AB > 0, ...) %>%
  group_by(teamId, yearId) %>%
  summarize(ave_bp = mean(1.0 * H / AB))

Subqueries

Sometimes it’s easier to nest queries like the one above into query and subquery

SELECT teamID, yearID, avg(BP) AS AVG_BP
FROM (SELECT b.teamId, b.yearId, 1.0 * b.H / b.AB AS BP
      FROM BATTING AS b
      WHERE b.AB > 0 AND
        b.yearID > 2000 AND
        b.yearID < 2010)
GROUP BY teamID, yearID
ORDER BY AVG_BP DESC;

Displaying records 1 - 10
teamID	yearID	AVG_BP
TBA	2007	0.2892067
MIN	2008	0.2875753
SEA	2007	0.2845483
SLN	2003	0.2832984
MIN	2001	0.2791984
SEA	2002	0.2738639
CLE	2006	0.2728260
TOR	2003	0.2648080
KCA	2008	0.2633696
CHA	2004	0.2603735

Joins

List all players from California, playing in 2015

library(Lahman)
library(tidyverse)

## ── Attaching packages ───────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ──────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Batting %>%
  inner_join(Master, by="playerID") %>%
  filter(yearID == "2015", birthState == "CA")

select b.playerId, b.teamId, m.birthState
from Batting as b join master as m on b.playerId == m.playerId
where yearId = "2015" and m.birthState = "CA"

Displaying records 1 - 10
playerID	teamID	birthState
alexasc01	KCA	CA
anderbr05	OAK	CA
anderco01	CLE	CA
andrima01	TBA	CA
arenano01	COL	CA
axelrdy01	CIN	CA
balesco01	CIN	CA
bandyje01	LAA	CA
barneau01	LAN	CA
barnebr02	COL	CA

Finally, we close the connection to the database:

dbDisconnect(db)