CMSC320 Intro. Data Science
Hector Corrada Bravo
1
Preamble
2
Introduction and Overview
2.1
What is Data Science?
2.1.1
Data
2.1.2
Specific Questions
2.1.3
Interdisciplinary Activities
2.1.4
Data-Centric Artifacts and Applications
2.2
Why Data Science?
2.3
Data Science in Society
2.4
Course Organization
2.5
General Workflow
2.5.1
Defining the Goal
2.5.2
Data Collection and Management
2.5.3
Modeling
2.5.4
Model Evaluation
2.5.5
Presentation
2.5.6
Deployment
3
An Illustrative Analysis
3.1
Gathering data
3.1.1
Movie ratings
3.1.2
Movie budgets and revenue
3.2
Manipulating the data
3.3
Visualizing the data
3.4
Modeling data
3.5
Visualizing model result
3.6
Abstracting the analysis
3.7
Making analyses accessible
3.8
Summary
4
Setting up the Data Science Toolbox
4.1
R/Rstudio
4.1.1
Some history
4.1.2
Setting up R
4.1.3
Setting up Rstudio
4.1.4
A first look at Rstudio
4.1.5
Interactive Console
4.1.6
Data Viewer
4.1.7
Names, values and functions
4.1.8
Plotting
4.1.9
Editor
4.1.10
Files viewer
4.1.11
R packages
4.1.12
Additional R resources
4.1.13
Literate Programming
4.1.14
Course packages
4.2
Python/Jupyter
(Part) Data representation modeling, ingestion and cleaning
5
Measurements and Data Types
5.1
A data analysis to get us going
5.2
Getting data
5.2.1
Names, values and functions
5.3
Entities and attributes
5.4
Categorical attributes
5.4.1
Factors in R
5.5
Discrete numeric attributes
5.6
Continuous numeric data
5.7
Other examples
5.8
Other important datatypes
5.9
Units
5.10
Quick questions
6
Principles: Basic Operations
6.1
Operations that subset attributes
6.1.1
select
6.1.2
rename
6.2
Operations that subset entities
6.2.1
slice
6.2.2
filter
6.2.3
sample_n
and
sample_frac
6.3
Pipelines of operations
7
Principles: More Operations
7.1
Operations that sort entities
7.2
Operations that create new attributes
7.3
Operations that summarize attribute values over entities
7.4
Operations that group entities
7.5
Vectors
7.6
Attributes as vectors
7.7
Functions
8
Basic plotting with
ggplot
8.1
Plot Construction Details
8.1.1
Mappings
8.1.2
Representations
8.2
Frequently Used Plots
8.2.1
Scatter plot
8.2.2
Bar graph
8.2.3
Histogram
8.2.4
Boxplot
9
Brief Introduction to Rmarkdown
10
Best Practices for Data Science Projects
11
Tidy Data I: The ER Model
11.1
Overview
11.2
The Entity-Relationship and Relational Models
11.2.1
Formal introduction to keys
11.3
Tidy Data
12
SQL I: Single Table Queries
12.1
Group-by and summarize
12.2
Subqueries
13
Two-table operations
13.1
Left Join
13.2
Right Join
13.3
Inner Join
13.4
Full Join
13.5
Join conditions
13.6
Filtering Joins
13.7
SQL Constructs: Multi-table Queries
14
SQL System Constructs
14.1
SQL as a Data Definition Language
14.2
Set Operations and Comparisons
14.3
Views
14.4
NULLs
15
DB Parting Shots
15.1
Database Query Optimization
15.2
JSON Data Model
16
Ingesting data
16.1
Structured ingestion
16.1.1
CSV files (and similar)
16.1.2
Excel spreadsheets
16.2
Scraping
16.2.1
Scraping from dirty HTML tables
17
Tidying data
17.1
Tidy Data
17.2
Common problems in messy data
17.2.1
Headers as values
17.2.2
Multiple variables in one column
17.2.3
Variables stored in both rows and columns
17.2.4
Multiple types in one table
18
Text and Dates
18.1
Text
18.1.1
String operations
18.1.2
Regular expressions
18.1.3
Tools using regular expressions
18.1.4
Extracting attributes from text
18.2
Handling dates
19
Entity Resolution and Record Linkage
19.1
Problem Definition
19.2
One approach: similarity function
19.2.1
Example attribute functions
19.3
Solving the resolution problem
19.3.1
Many-to-one resolutions
19.3.2
One-to-one resolutions
19.3.3
Other constraints
19.4
Discussion
(Part) Exploratory Data Analysis
20
Exploratory Data Analysis: Visualization
20.0.1
EDA (Exploratory Data Analysis)
20.1
Visualization of single variables
20.1.1
Visualization of pairs of variables
20.2
EDA with the grammar of graphics
20.2.1
Other aesthetics
20.2.2
Faceting
21
Exploratory Data Analysis: Summary Statistics
21.1
Range
21.2
Central Tendency
21.2.1
Derivation of the mean as central tendency statistic
21.3
Spread
21.3.1
Variance
21.3.2
Spread estimates using rank statistics
21.4
Outliers
21.5
Skew
21.6
Covariance and correlation
21.7
Postscript: Finding Maxima/Minima using Derivatives
21.7.1
Steps to find Maxima/Minima of function
\(f(x)\)
21.7.2
Notes on Finding Derivatives
21.7.3
Resources:
22
EDA: Data Transformations
22.1
Centering and scaling
22.2
Treating categorical variables as numeric
22.2.1
Discretizing continuous values.
22.3
Skewed Data
23
EDA: Handling Missing Data
23.1
Mechanisms of missing data
23.2
Handling missing data
23.2.1
Removing missing data
23.2.2
Encoding as missing
23.2.3
Imputation
23.3
Implications of imputation
(Part) Statistical Learning
24
Univariate distributions and statistics
24.1
Variation, randomness and stochasticity
24.1.1
Random variables
24.2
(Discrete) Probability distributions
24.2.1
Example The oracle of TWEET
24.3
Expectation
24.4
Estimation
24.4.1
Law of large numbers (LLN)
24.4.2
Central Limit Theorem (CLT)
24.5
The normal distribution
24.5.1
CLT continued
24.6
The Bootstrap Procedure
25
Experiment design and hypothesis testing
25.1
Inference
25.1.1
Hypothesis testing
25.2
A/B Testing
25.3
Summary
25.4
Probability Distributions
25.4.1
Bernoulli
25.4.2
Binomial
25.4.3
Normal (Gaussian) distribution
25.4.4
Distributions in R
26
Multivariate probability
26.1
Joint and conditional probability
26.2
Bayes’ Rule
26.3
Conditional expectation
26.4
Maximum likelihood
(Part) Machine Learning
27
Data Analysis with Geometry
27.1
Motivating Example: Credit Analysis
27.2
From data to feature vectors
27.3
Technical notation
27.4
Geometry and Distances
27.4.1
K-nearest neighbor classification
27.4.2
The importance of transformations
27.5
Quick vector algebra review
27.5.1
Quiz
27.6
The curse of dimensionality
27.7
Summary
28
Linear Regression
28.1
Simple Regression
28.2
Inference
28.2.1
Confidence Interval
28.2.2
The
\(t\)
-statistic and the
\(t\)
-distribution
28.2.3
Global Fit
28.3
Some important technicalities
28.4
Issues with linear regression
28.4.1
Non-linearity of outcome-predictor relationship
28.4.2
Correlated Error
28.4.3
Non-constant variance
28.5
Multiple linear regression
28.5.1
Estimation in multivariate regression
28.5.2
Example (cont’d)
28.5.3
Statistical statements (cont’d)
28.5.4
The F-test
28.5.5
Categorical predictors (cont’d)
28.6
Interactions in linear models
28.6.1
Additional issues with linear regression
29
Linear models for classification
29.1
An example classification problem
29.2
Why not linear regression?
29.3
Classification as probability estimation problem
29.4
Logistic regression
29.4.1
Exercises
29.4.2
Making predictions
29.4.3
Multiple logistic regression
29.4.4
Exercise
29.5
Linear Discriminant Analysis
29.5.1
How to train LDA
29.6
Summary
30
Solving linear ML problems
30.1
Case Study
30.2
Gradient Descent
30.2.1
Logistic Regression
30.3
Stochastic gradient descent
30.4
Parallelizing gradient descent
31
Tree-Based Methods
31.1
Regression Trees
31.2
Classification (Decision) Trees
31.3
Specifics of the partitioning algorithm
31.3.1
The predictor space
31.3.2
Learning Strategy
31.3.3
Tree Growing
31.3.4
Deviance as a measure of impurity
31.3.5
Other measures of impurity
31.3.6
Tree Pruning
31.4
Properties of Tree Method
31.5
Random Forests
31.6
Tree-based methods summary
32
Model Selection and Evaluation
32.1
Classifier evaluation
32.2
Model selection
32.2.1
Cross Validation
32.2.2
Validation Set
32.2.3
Resampled validation set
32.2.4
Leave-one-out Cross-Validation
32.2.5
k-fold Cross-Validation
32.2.6
Cross-Validation in Classification
32.2.7
Comparing models statistically using cross-validation
32.3
Summary
33
Unsupervised Learning: Clustering
33.1
Motivating Example
33.2
Some Preliminaries
33.3
Cluster Analysis
33.4
Dissimilarity-based Clustering
33.5
K-means Clustering
33.6
Choosing the number of clusters
33.7
Summary
34
Unsupervised Learning: Dimensionality Reduction
34.1
Principal Component Analysis
34.1.1
Solving the PCA
34.2
Multidimensional Scaling
34.3
Summary
Lecture Notes: Introduction to Data Science
(Part) Exploratory Data Analysis