{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Examples with Baseball\n", "\n", "Here we show SQL constructs using the Lahman baseball dataset (downloaded from https://github.com/jknecht/baseball-archive-sqlite). We also show how to use a SQL database inside a Jupyter notebook.\n", "\n", "First, we create a connection to the database. In this case, we are using a `SQLite` database. A good system to prototype database designs. To make the most of a database system, one would use some of the more powerful products: Oracle, Microsoft SQLServer, MySQL (MariaDB), PostgreSQL or other. In all cases, the way to create a connection to the server from Rmarkdown is the same." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import sqlite3" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "con = sqlite3.connect(r'data/lahman2016.sqlite')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select-From-Where\n", "\n", "First, we write a query to get batting statistics for Washington Nationals in 2010: note the table *rename*. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerIDyearIDHAB
0atilalu012010125
1balesco01201000
2batismi01201018
3bergmja01201000
4bernaro012010102414
5bisenjo01201000
6brunebr01201000
7burkeja02201000
8burnese01201000
9cappsma01201001
\n", "
" ], "text/plain": [ " playerID yearID H AB\n", "0 atilalu01 2010 1 25\n", "1 balesco01 2010 0 0\n", "2 batismi01 2010 1 8\n", "3 bergmja01 2010 0 0\n", "4 bernaro01 2010 102 414\n", "5 bisenjo01 2010 0 0\n", "6 brunebr01 2010 0 0\n", "7 burkeja02 2010 0 0\n", "8 burnese01 2010 0 0\n", "9 cappsma01 2010 0 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.playerId, b.yearId, b.H, b.AB\n", "FROM BATTING AS b\n", "WHERE teamID = 'WAS' AND yearID = 2010\n", "\"\"\"\n", "df = pd.read_sql_query(query, con)\n", "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expressions\n", "\n", "The **select** clause can contain expressions (this is paralleled by the `assign` operation we saw previously)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerIDyearIDABBP
0aardsda0120040NaN
1aardsda01200620.0
2aardsda0120070NaN
3aardsda01200810.0
4aardsda0120090NaN
\n", "
" ], "text/plain": [ " playerID yearID AB BP\n", "0 aardsda01 2004 0 NaN\n", "1 aardsda01 2006 2 0.0\n", "2 aardsda01 2007 0 NaN\n", "3 aardsda01 2008 1 0.0\n", "4 aardsda01 2009 0 NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.playerId, b.yearId, b.AB, 1.0 * b.H / b.AB AS BP\n", "FROM BATTING AS b\n", "\"\"\"\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### WHERE predicates\n", "\n", "The **where** clause support a large number of different predicates and combinations thereof (this is parallel to the `query` operation)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerIDyearIDteamIDABBP
0abreubo012006NYA2090.330144
1abreubo012007NYA6050.282645
2abreubo012008NYA6090.295567
3aceveal012009NYA20.000000
4agbaybe012001NYN2960.277027
\n", "
" ], "text/plain": [ " playerID yearID teamID AB BP\n", "0 abreubo01 2006 NYA 209 0.330144\n", "1 abreubo01 2007 NYA 605 0.282645\n", "2 abreubo01 2008 NYA 609 0.295567\n", "3 aceveal01 2009 NYA 2 0.000000\n", "4 agbaybe01 2001 NYN 296 0.277027" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP\n", "FROM BATTING AS b\n", "WHERE b.AB > 0 AND\n", " b.yearID > 2000 AND\n", " b.yearID < 2010 AND \n", " b.teamID LIKE 'NY%'\n", "\"\"\"\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ORDERING\n", "\n", "We can include ordering (parallel to `sort_values`)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerIDyearIDteamIDABBP
0soriaal012002NYA6960.300287
1reyesjo012005NYN6960.272989
2reyesjo012008NYN6880.296512
3soriaal012003NYA6820.290323
4reyesjo012007NYN6810.280470
\n", "
" ], "text/plain": [ " playerID yearID teamID AB BP\n", "0 soriaal01 2002 NYA 696 0.300287\n", "1 reyesjo01 2005 NYN 696 0.272989\n", "2 reyesjo01 2008 NYN 688 0.296512\n", "3 soriaal01 2003 NYA 682 0.290323\n", "4 reyesjo01 2007 NYN 681 0.280470" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.playerId, b.yearID, b. teamId, b.AB, 1.0 * b.H / b.AB AS BP\n", "FROM BATTING AS b\n", "WHERE b.AB > 0 AND\n", " b.yearID > 2000 AND\n", " b.yearID < 2010 AND \n", " b.teamID LIKE 'NY%'\n", "ORDER BY b.AB DESC, BP DESC\n", "\"\"\"\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group_by and Summarize\n", "\n", "- What it does: Partition the tuples by the group attributes (*teamId* and *yearId* in this case), and do something (*compute avg* in this case) for each group\n", "- Number of resulting tuples == Number of groups" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamIDyearIDAVE_BP
0TBA20070.289207
1MIN20080.287575
2SEA20070.284548
3SLN20030.283298
4MIN20010.279198
\n", "
" ], "text/plain": [ " teamID yearID AVE_BP\n", "0 TBA 2007 0.289207\n", "1 MIN 2008 0.287575\n", "2 SEA 2007 0.284548\n", "3 SLN 2003 0.283298\n", "4 MIN 2001 0.279198" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.teamId, b.yearId, avg(1.0 * b.H / b.AB) AS AVE_BP\n", "FROM BATTING AS b\n", "WHERE b.AB > 0 AND\n", " b.yearID > 2000 AND\n", " b.yearID < 2010\n", "GROUP BY b.teamId, b.yearId\n", "ORDER BY AVE_BP DESC\n", "\"\"\"\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reference, this is how we would do this using \n", "the `pandas` operations we learned about previously." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "Batting\n", " .assign(BP = 1.0 * Batting.H / Batting.AB)\n", " .query('AB > 0 & ...')\n", " .groupby(['teamId', 'yearId'])\n", " .agg({'AB': ['mean']})\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subqueries\n", "\n", "Sometimes it's easier to nest queries like the one above into query and subquery" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamIdyearIdAVG_BP
0TBA20070.289207
1MIN20080.287575
2SEA20070.284548
3SLN20030.283298
4MIN20010.279198
\n", "
" ], "text/plain": [ " teamId yearId AVG_BP\n", "0 TBA 2007 0.289207\n", "1 MIN 2008 0.287575\n", "2 SEA 2007 0.284548\n", "3 SLN 2003 0.283298\n", "4 MIN 2001 0.279198" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT teamID, yearID, avg(BP) AS AVG_BP\n", "FROM (SELECT b.teamId, b.yearId, 1.0 * b.H / b.AB AS BP\n", " FROM BATTING AS b\n", " WHERE b.AB > 0 AND\n", " b.yearID > 2000 AND\n", " b.yearID < 2010)\n", "GROUP BY teamID, yearID\n", "ORDER BY AVG_BP DESC;\n", "\"\"\"\n", "\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joins\n", "\n", "List all players from California, playing in 2015" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerIDteamIDbirthState
0alexasc01KCACA
1anderbr05OAKCA
2anderco01CLECA
3andrima01TBACA
4arenano01COLCA
\n", "
" ], "text/plain": [ " playerID teamID birthState\n", "0 alexasc01 KCA CA\n", "1 anderbr05 OAK CA\n", "2 anderco01 CLE CA\n", "3 andrima01 TBA CA\n", "4 arenano01 COL CA" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT b.playerId, b.teamId, m.birthState\n", "FROM Batting as b JOIN master as m on b.playerId == m.playerId\n", "WHERE yearId = \"2015\" and m.birthState = \"CA\"\n", "\"\"\" \n", "\n", "df = pd.read_sql_query(query, con)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we close the connection to the database:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "con.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 4 }