Last Update: 2019-05-14
We will use Mortgage Affordability data from Zillow to experiment with classification algorithms. The data was downloaded from Zillow Research page: https://www.zillow.com/research/data/
It is made available here: http://www.hcbravo.org/IntroDataSci/misc/Affordability_Wide_2017Q4_Public.csv
First, we will tidy the data. Please include this piece of code in your submission.
library(tidyverse)
library(lubridate)
theme_set(theme_bw())
csv_file <- "Affordability_Wide_2017Q4_Public.csv"
tidy_afford <- read_csv(csv_file) %>%
filter(Index == "Mortgage Affordability") %>%
drop_na() %>%
filter(RegionID != 0, RegionName != "United States") %>%
dplyr::select(RegionID, RegionName, matches("^[1|2]")) %>%
gather(time, affordability, matches("^[1|2]")) %>%
type_convert(col_types=cols(time=col_date(format="%Y-%m")))
tidy_afford
## # A tibble: 12,480 x 4
## RegionID RegionName time affordability
## <dbl> <chr> <date> <dbl>
## 1 394913 New York, NY 1979-03-01 0.262
## 2 753899 Los Angeles-Long Beach-Anaheim, CA 1979-03-01 0.358
## 3 394463 Chicago, IL 1979-03-01 0.262
## 4 394514 Dallas-Fort Worth, TX 1979-03-01 0.301
## 5 394974 Philadelphia, PA 1979-03-01 0.204
## 6 394692 Houston, TX 1979-03-01 0.243
## 7 395209 Washington, DC 1979-03-01 0.254
## 8 394856 Miami-Fort Lauderdale, FL 1979-03-01 0.268
## 9 394347 Atlanta, GA 1979-03-01 0.248
## 10 394404 Boston, MA 1979-03-01 0.222
## # … with 12,470 more rows
This is what the data looks like:
tidy_afford %>%
ggplot(aes(x=time,y=affordability,group=factor(RegionID))) +
geom_line(color="GRAY", alpha=3/4, size=1/2) +
labs(title="County-Level Mortgage Affordability over Time",
x="Date", y="Mortgage Affordability")
The prediction task we are going to answer is:
Can we predict if mortgage affordability will increase or decrease a year from now"
Specifically, we will do this for quarter 4 (Q4) of 2017. To create the outcome we will predict we will compare affordability for Q4 of 2017 to Q4 of 2016 and label it as up
or down
depending on the sign of the this difference. Let’s create the outcome we want to predict (again, copy this bit of code to your submission):
outcome_df <- tidy_afford %>%
mutate(yq = quarter(time, with_year=TRUE)) %>%
filter(yq %in% c("2016.4", "2017.4")) %>%
select(RegionID, RegionName, yq, affordability) %>%
spread(yq, affordability) %>%
mutate(diff = `2017.4` - `2016.4`) %>%
mutate(Direction = ifelse(diff>0, "up", "down")) %>%
select(RegionID, RegionName, Direction)
outcome_df
## # A tibble: 80 x 3
## RegionID RegionName Direction
## <dbl> <chr> <chr>
## 1 394304 Akron, OH down
## 2 394312 Albuquerque, NM down
## 3 394318 Allentown, PA down
## 4 394347 Atlanta, GA up
## 5 394355 Austin, TX up
## 6 394357 Bakersfield, CA down
## 7 394358 Baltimore, MD down
## 8 394367 Baton Rouge, LA up
## 9 394378 Bellingham, WA up
## 10 394388 Birmingham, AL down
## # … with 70 more rows
Now, you have a dataframe with outcomes (labels) for each county in the dataset.
The goal is then given predictors \(X_i\) for county \(i\), build a classifier for outcome \(G_i \in \{\mathtt{up},\mathtt{down}\}\).
For your classifiers you should use data up to 2016.
predictor_df <- tidy_afford %>%
filter(year(time) <= 2016)
Your goal for this project is to do an experiment to address a (one, single) technical question about our ability to make this prediction. There is a list of possible questions you may address below. Each of them asks two compare two specific choices in the classification workflow (e.g., two classification algorithms, two feature representations, etc.). You will implement each of the two choices and use 10-fold cross validation (across RegionID’s) to compare their relative performance. You will also create an AUROC curve to compare them.
Note that you still have to make some choices regardless of the question you choose. For example, to do the feature preprocessing and representation experiments you have to choose a classifier (random forest for example), and decide what to do about hyper-parameters if appropriate.
Prepare an Rmarkdown file which includes:
Let’s build a Random Forest classifer using quarterly differences after data standardization for years 2014-2016. First, filter to the years of interest and standardize affordability for each region
standardized_df <- predictor_df %>%
filter(year(time) %in% 2014:2016) %>%
group_by(RegionID) %>%
mutate(mean_aff = mean(affordability)) %>%
mutate(sd_aff = sd(affordability)) %>%
mutate(z_aff = (affordability - mean_aff) / sd_aff) %>%
ungroup()
To train our model we need a table with one row per region, and attributes corresponding to differences in quarterly affordability. We will do this in stages, first we turn the tidy dataset into a wide dataset using tidyr::spread
then create a dataframe containing the differences we use as features.
wide_df <- standardized_df %>%
select(RegionID, time, z_aff) %>%
tidyr::spread(time, z_aff)
Now, we turn this into quarterly differences
matrix_1 <- wide_df %>%
select(-RegionID) %>%
as.matrix() %>%
.[,-1]
matrix_2 <- wide_df %>%
select(-RegionID) %>%
as.matrix() %>%
.[,-ncol(.)]
diff_df <- (matrix_1 - matrix_2) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(RegionID = wide_df$RegionID)
Finally, add the outcome we want to predict from the data frame we created previously.
final_df <- diff_df %>%
inner_join(outcome_df %>% select(RegionID, Direction), by="RegionID") %>%
mutate(Direction=factor(Direction, levels=c("down", "up")))
final_df
## # A tibble: 80 x 13
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.482 -0.224 -1.13 -0.113 1.22 0.0828 0.270 -0.869 -0.187
## 2 -0.656 -0.0920 -1.01 -0.153 0.757 -0.0590 -0.138 -0.908 -0.449
## 3 -0.423 -0.251 -1.20 -0.387 0.646 -0.0271 -0.110 -0.718 -0.498
## 4 -0.308 0.0988 -0.938 0.179 1.26 0.0681 0.392 -0.874 -0.361
## 5 -0.371 0.0929 -1.03 0.150 1.60 -0.107 0.307 -0.541 -0.384
## 6 -0.496 0.239 -0.738 0.362 1.04 -0.0637 0.465 -0.607 -0.0931
## 7 -0.472 -0.118 -1.05 -0.458 0.509 -0.285 0.116 -0.844 -0.327
## 8 -0.485 0.261 -0.661 -0.0810 0.921 0.0145 0.427 -0.213 0.231
## 9 -0.438 0.527 -1.11 0.0971 0.720 0.191 0.186 -0.219 0.196
## 10 -0.919 -0.480 -1.01 -0.170 0.814 -0.407 0.0552 -0.684 -0.441
## # … with 70 more rows, and 4 more variables: V10 <dbl>, V11 <dbl>,
## # RegionID <dbl>, Direction <fct>
For illustration, split up the data set into training regions and testing regions (using 50/50 random split).
set.seed(1234)
test_random_forest_df <- final_df %>%
group_by(Direction) %>%
sample_frac(.5) %>%
ungroup()
train_random_forest_df <- final_df %>%
anti_join(test_random_forest_df, by="RegionID")
Let’s learn the random forest using default parameters.
library(caret)
rf_fit <- train(Direction~.,
data = train_random_forest_df %>%
select(-RegionID),
method="rf",
trControl=trainControl(method="none"))
rf_fit
## Random Forest
##
## 40 samples
## 11 predictors
## 2 classes: 'down', 'up'
##
## No pre-processing
## Resampling: None
Now, let’s make predictions on the held out test set
test_predictions <- predict(rf_fit, newdata = test_random_forest_df %>%
select(-RegionID))
Make a confusion matrix and calculate error rate.
table(pred=test_predictions,
observed=test_random_forest_df$Direction)
## observed
## pred down up
## down 6 4
## up 8 22
Error rate is 30%
Using this classifier we show an example experiment comparing random forests with different number of predictors. We will use but 5-fold cross-validation to compare a random forest with 500 trees, with a random forest with 10 trees. Since this is a smallish dataset, I am using 5-fold cross validation to make the validation sets have more examples and therefore more reliable performance estimates.
We use the caret::createFolds
function to create a stratified 5-fold cross-validation partition. The result is a list of length 5, containing the indices of the examples used for validation in each of the cross-validation folds.
We use the train
function in the caret
package to organize our model training. It takes a trControl
argument which sets up how to score models. Here we tell it to use 5-fold cross validation and measure sensitivity and specificity as performance metrics (twoClassSummary
). We also tell it to savePredictions
so we can make ROC plots later on. Other parameters needed by the randomForest
function are passed through the train
function.
set.seed(1234)
# create the cross-validation partition
cv_partition <- createFolds(final_df$Direction,
k=5)
fit_control <- trainControl( ## 5-fold CV
method = "cv",
number = 5,
indexOut = cv_partition,
summaryFunction=twoClassSummary,
classProbs=TRUE,
savePredictions=TRUE)
big_rf_fit <- train(Direction~.,
data=final_df,
method = "rf",
ntree = 500,
trControl = fit_control,
metric="ROC")
small_rf_fit <- train(Direction~.,
data = final_df,
method = "rf",
ntree = 10,
trControl = fit_control,
metric="ROC")
show(big_rf_fit)
## Random Forest
##
## 80 samples
## 12 predictors
## 2 classes: 'down', 'up'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 63, 65, 64, 64, 64
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9926667 0.9666667 0.96
## 7 0.9926667 0.9666667 0.94
## 12 0.9926667 0.8933333 0.96
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
small_rf_fit
## Random Forest
##
## 80 samples
## 12 predictors
## 2 classes: 'down', 'up'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 63, 64, 65, 64, 64
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9585758 0.8866667 0.9218182
## 7 0.9415455 0.7866667 0.9400000
## 12 0.9098788 0.8466667 0.9218182
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Now let’s compare models based on 5-CV area under the ROC curve (AUROC). For that purpose we use the resamples
function in caret
, which organizes the estimated AUROC values from each fold.
model_comparison <- resamples(list(small = small_rf_fit,
big=big_rf_fit))
ggplot(model_comparison) +
labs(title="ROC comparision", x="Model", y="ROC")
We can test for differences using linear regression.
library(broom)
comparison_df <- model_comparison %>%
as_tibble() %>%
gather("model", "roc", -Resample) %>%
mutate(model=factor(model, levels=c("small","big")))
comparison_df %>%
lm(roc~model,data=.) %>%
tidy() %>%
knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.9585758 | 0.0182926 | 52.402294 | 0.0000000 |
modelbig | 0.0340909 | 0.0258697 | 1.317794 | 0.2240565 |
We see that there is a small increase in average AUROC for the big model but it is not a statistically significant difference.
Finally, let’s make ROC curves of both models.
library(plotROC)
roc_df <-
big_rf_fit$pred %>%
filter(mtry == 2) %>%
mutate(model = "big") %>%
bind_rows(small_rf_fit$pred %>%
filter(mtry == 2) %>%
mutate(model = "small"))
roc_df %>%
ggplot(aes(m=up,
d=factor(obs, levels=c("down","up")),
color=model)) +
geom_roc(n.cuts=0) +
coord_equal() +
style_roc()