{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQL Examples with Baseball\n",
"\n",
"Here we show SQL constructs using the Lahman baseball dataset (downloaded from https://github.com/jknecht/baseball-archive-sqlite). We also show how to use a SQL database inside a Jupyter notebook.\n",
"\n",
"First, we create a connection to the database. In this case, we are using a `SQLite` database. A good system to prototype database designs. To make the most of a database system, one would use some of the more powerful products: Oracle, Microsoft SQLServer, MySQL (MariaDB), PostgreSQL or other. In all cases, the way to create a connection to the server from Rmarkdown is the same."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sqlite3"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"con = sqlite3.connect(r'data/lahman2016.sqlite')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select-From-Where\n",
"\n",
"First, we write a query to get batting statistics for Washington Nationals in 2010: note the table *rename*. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerID | \n",
" yearID | \n",
" H | \n",
" AB | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" atilalu01 | \n",
" 2010 | \n",
" 1 | \n",
" 25 | \n",
"
\n",
" \n",
" 1 | \n",
" balesco01 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" batismi01 | \n",
" 2010 | \n",
" 1 | \n",
" 8 | \n",
"
\n",
" \n",
" 3 | \n",
" bergmja01 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" bernaro01 | \n",
" 2010 | \n",
" 102 | \n",
" 414 | \n",
"
\n",
" \n",
" 5 | \n",
" bisenjo01 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" brunebr01 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" burkeja02 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 8 | \n",
" burnese01 | \n",
" 2010 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" cappsma01 | \n",
" 2010 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerID yearID H AB\n",
"0 atilalu01 2010 1 25\n",
"1 balesco01 2010 0 0\n",
"2 batismi01 2010 1 8\n",
"3 bergmja01 2010 0 0\n",
"4 bernaro01 2010 102 414\n",
"5 bisenjo01 2010 0 0\n",
"6 brunebr01 2010 0 0\n",
"7 burkeja02 2010 0 0\n",
"8 burnese01 2010 0 0\n",
"9 cappsma01 2010 0 1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.playerId, b.yearId, b.H, b.AB\n",
"FROM BATTING AS b\n",
"WHERE teamID = 'WAS' AND yearID = 2010\n",
"\"\"\"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Expressions\n",
"\n",
"The **select** clause can contain expressions (this is paralleled by the `assign` operation we saw previously)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerID | \n",
" yearID | \n",
" AB | \n",
" BP | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" aardsda01 | \n",
" 2004 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" aardsda01 | \n",
" 2006 | \n",
" 2 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" aardsda01 | \n",
" 2007 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" aardsda01 | \n",
" 2008 | \n",
" 1 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" aardsda01 | \n",
" 2009 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerID yearID AB BP\n",
"0 aardsda01 2004 0 NaN\n",
"1 aardsda01 2006 2 0.0\n",
"2 aardsda01 2007 0 NaN\n",
"3 aardsda01 2008 1 0.0\n",
"4 aardsda01 2009 0 NaN"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.playerId, b.yearId, b.AB, 1.0 * b.H / b.AB AS BP\n",
"FROM BATTING AS b\n",
"\"\"\"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### WHERE predicates\n",
"\n",
"The **where** clause support a large number of different predicates and combinations thereof (this is parallel to the `query` operation)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerID | \n",
" yearID | \n",
" teamID | \n",
" AB | \n",
" BP | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" abreubo01 | \n",
" 2006 | \n",
" NYA | \n",
" 209 | \n",
" 0.330144 | \n",
"
\n",
" \n",
" 1 | \n",
" abreubo01 | \n",
" 2007 | \n",
" NYA | \n",
" 605 | \n",
" 0.282645 | \n",
"
\n",
" \n",
" 2 | \n",
" abreubo01 | \n",
" 2008 | \n",
" NYA | \n",
" 609 | \n",
" 0.295567 | \n",
"
\n",
" \n",
" 3 | \n",
" aceveal01 | \n",
" 2009 | \n",
" NYA | \n",
" 2 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" agbaybe01 | \n",
" 2001 | \n",
" NYN | \n",
" 296 | \n",
" 0.277027 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerID yearID teamID AB BP\n",
"0 abreubo01 2006 NYA 209 0.330144\n",
"1 abreubo01 2007 NYA 605 0.282645\n",
"2 abreubo01 2008 NYA 609 0.295567\n",
"3 aceveal01 2009 NYA 2 0.000000\n",
"4 agbaybe01 2001 NYN 296 0.277027"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP\n",
"FROM BATTING AS b\n",
"WHERE b.AB > 0 AND\n",
" b.yearID > 2000 AND\n",
" b.yearID < 2010 AND \n",
" b.teamID LIKE 'NY%'\n",
"\"\"\"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ORDERING\n",
"\n",
"We can include ordering (parallel to `sort_values`)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerID | \n",
" yearID | \n",
" teamID | \n",
" AB | \n",
" BP | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" soriaal01 | \n",
" 2002 | \n",
" NYA | \n",
" 696 | \n",
" 0.300287 | \n",
"
\n",
" \n",
" 1 | \n",
" reyesjo01 | \n",
" 2005 | \n",
" NYN | \n",
" 696 | \n",
" 0.272989 | \n",
"
\n",
" \n",
" 2 | \n",
" reyesjo01 | \n",
" 2008 | \n",
" NYN | \n",
" 688 | \n",
" 0.296512 | \n",
"
\n",
" \n",
" 3 | \n",
" soriaal01 | \n",
" 2003 | \n",
" NYA | \n",
" 682 | \n",
" 0.290323 | \n",
"
\n",
" \n",
" 4 | \n",
" reyesjo01 | \n",
" 2007 | \n",
" NYN | \n",
" 681 | \n",
" 0.280470 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerID yearID teamID AB BP\n",
"0 soriaal01 2002 NYA 696 0.300287\n",
"1 reyesjo01 2005 NYN 696 0.272989\n",
"2 reyesjo01 2008 NYN 688 0.296512\n",
"3 soriaal01 2003 NYA 682 0.290323\n",
"4 reyesjo01 2007 NYN 681 0.280470"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP\n",
"FROM BATTING AS b\n",
"WHERE b.AB > 0 AND\n",
" b.yearID > 2000 AND\n",
" b.yearID < 2010 AND \n",
" b.teamID LIKE 'NY%'\n",
"ORDER BY b.AB DESC, BP DESC\n",
"\"\"\"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Group_by and Summarize\n",
"\n",
"- What it does: Partition the tuples by the group attributes (*teamId* and *yearId* in this case), and do something (*compute avg* in this case) for each group\n",
"- Number of resulting tuples == Number of groups"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" teamID | \n",
" yearID | \n",
" AVE_BP | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TBA | \n",
" 2007 | \n",
" 0.289207 | \n",
"
\n",
" \n",
" 1 | \n",
" MIN | \n",
" 2008 | \n",
" 0.287575 | \n",
"
\n",
" \n",
" 2 | \n",
" SEA | \n",
" 2007 | \n",
" 0.284548 | \n",
"
\n",
" \n",
" 3 | \n",
" SLN | \n",
" 2003 | \n",
" 0.283298 | \n",
"
\n",
" \n",
" 4 | \n",
" MIN | \n",
" 2001 | \n",
" 0.279198 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" teamID yearID AVE_BP\n",
"0 TBA 2007 0.289207\n",
"1 MIN 2008 0.287575\n",
"2 SEA 2007 0.284548\n",
"3 SLN 2003 0.283298\n",
"4 MIN 2001 0.279198"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.teamId, b.yearId, avg(1.0 * b.H / b.AB) AS AVE_BP\n",
"FROM BATTING AS b\n",
"WHERE b.AB > 0 AND\n",
" b.yearID > 2000 AND\n",
" b.yearID < 2010\n",
"GROUP BY b.teamId, b.yearId\n",
"ORDER BY AVE_BP DESC\n",
"\"\"\"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For reference, this is how we would do this using \n",
"the `pandas` operations we learned about previously."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"Batting\n",
" .assign(BP = 1.0 * Batting.H / Batting.AB)\n",
" .query('AB > 0 & ...')\n",
" .groupby(['teamId', 'yearId'])\n",
" .agg({'AB': ['mean']})\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subqueries\n",
"\n",
"Sometimes it's easier to nest queries like the one above into query and subquery"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" teamId | \n",
" yearId | \n",
" AVG_BP | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" TBA | \n",
" 2007 | \n",
" 0.289207 | \n",
"
\n",
" \n",
" 1 | \n",
" MIN | \n",
" 2008 | \n",
" 0.287575 | \n",
"
\n",
" \n",
" 2 | \n",
" SEA | \n",
" 2007 | \n",
" 0.284548 | \n",
"
\n",
" \n",
" 3 | \n",
" SLN | \n",
" 2003 | \n",
" 0.283298 | \n",
"
\n",
" \n",
" 4 | \n",
" MIN | \n",
" 2001 | \n",
" 0.279198 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" teamId yearId AVG_BP\n",
"0 TBA 2007 0.289207\n",
"1 MIN 2008 0.287575\n",
"2 SEA 2007 0.284548\n",
"3 SLN 2003 0.283298\n",
"4 MIN 2001 0.279198"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT teamID, yearID, avg(BP) AS AVG_BP\n",
"FROM (SELECT b.teamId, b.yearId, 1.0 * b.H / b.AB AS BP\n",
" FROM BATTING AS b\n",
" WHERE b.AB > 0 AND\n",
" b.yearID > 2000 AND\n",
" b.yearID < 2010)\n",
"GROUP BY teamID, yearID\n",
"ORDER BY AVG_BP DESC;\n",
"\"\"\"\n",
"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Joins\n",
"\n",
"List all players from California, playing in 2015"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerID | \n",
" teamID | \n",
" birthState | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" alexasc01 | \n",
" KCA | \n",
" CA | \n",
"
\n",
" \n",
" 1 | \n",
" anderbr05 | \n",
" OAK | \n",
" CA | \n",
"
\n",
" \n",
" 2 | \n",
" anderco01 | \n",
" CLE | \n",
" CA | \n",
"
\n",
" \n",
" 3 | \n",
" andrima01 | \n",
" TBA | \n",
" CA | \n",
"
\n",
" \n",
" 4 | \n",
" arenano01 | \n",
" COL | \n",
" CA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerID teamID birthState\n",
"0 alexasc01 KCA CA\n",
"1 anderbr05 OAK CA\n",
"2 anderco01 CLE CA\n",
"3 andrima01 TBA CA\n",
"4 arenano01 COL CA"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT b.playerId, b.teamId, m.birthState\n",
"FROM Batting as b JOIN master as m on b.playerId == m.playerId\n",
"WHERE yearId = \"2015\" and m.birthState = \"CA\"\n",
"\"\"\" \n",
"\n",
"df = pd.read_sql_query(query, con)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we close the connection to the database:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"con.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}