Last Update: 2018-05-05

Data

We will use Mortgage Affordability data from Zillow to experiment with classification algorithms. The data was downloaded from Zillow Research page: https://www.zillow.com/research/data/

It is made available here: http://www.hcbravo.org/IntroDataSci/misc/Affordability_Wide_2017Q4_Public.csv

Preparing data

First, we will tidy the data. Please include this piece of code in your submission.

library(tidyverse)
library(lubridate)
theme_set(theme_bw())
csv_file <- "Affordability_Wide_2017Q4_Public.csv"
tidy_afford <- read_csv(csv_file) %>%
  filter(Index == "Mortgage Affordability") %>%
  drop_na() %>%
  filter(RegionID != 0, RegionName != "United States") %>%
  dplyr::select(RegionID, RegionName, matches("^[1|2]")) %>%
  gather(time, affordability, matches("^[1|2]")) %>%
  type_convert(col_types=cols(time=col_date(format="%Y-%m")))
tidy_afford
## # A tibble: 12,480 x 4
##    RegionID RegionName                         time       affordability
##       <int> <chr>                              <date>             <dbl>
##  1   394913 New York, NY                       1979-03-01         0.262
##  2   753899 Los Angeles-Long Beach-Anaheim, CA 1979-03-01         0.358
##  3   394463 Chicago, IL                        1979-03-01         0.262
##  4   394514 Dallas-Fort Worth, TX              1979-03-01         0.301
##  5   394974 Philadelphia, PA                   1979-03-01         0.204
##  6   394692 Houston, TX                        1979-03-01         0.243
##  7   395209 Washington, DC                     1979-03-01         0.254
##  8   394856 Miami-Fort Lauderdale, FL          1979-03-01         0.268
##  9   394347 Atlanta, GA                        1979-03-01         0.248
## 10   394404 Boston, MA                         1979-03-01         0.222
## # ... with 12,470 more rows

This is what the data looks like:

tidy_afford %>%
  ggplot(aes(x=time,y=affordability,group=factor(RegionID))) +
  geom_line(color="GRAY", alpha=3/4, size=1/2) +
  labs(title="County-Level Mortgage Affordability over Time",
          x="Date", y="Mortgage Affordability")

The prediction task

The prediction task we are going to answer is:

Can we predict if mortgage affordability will increase or decrease a year from now"

Specifically, we will do this for the last observation in the dataset (quarter 4 (Q4) of 2017). To create the outcome we will predict we will compare affordability for Q4 of 2017 and to Q4 of 2016 and label it as up or down depending on the sign of the this difference. Let’s create the outcome we want to predict (again, copy this bit of code to your submission):

outcome_df <- tidy_afford %>%
  mutate(yq = quarter(time, with_year=TRUE)) %>%
  filter(yq %in% c("2016.4", "2017.4")) %>%
  select(RegionID, RegionName, yq, affordability) %>%
  spread(yq, affordability) %>%
  mutate(diff = `2017.4` - `2016.4`) %>%
  mutate(Direction = ifelse(diff>0, "up", "down")) %>%
  select(RegionID, RegionName, Direction)
outcome_df
## # A tibble: 80 x 3
##    RegionID RegionName      Direction
##       <int> <chr>           <chr>    
##  1   394304 Akron, OH       down     
##  2   394312 Albuquerque, NM down     
##  3   394318 Allentown, PA   down     
##  4   394347 Atlanta, GA     up       
##  5   394355 Austin, TX      up       
##  6   394357 Bakersfield, CA down     
##  7   394358 Baltimore, MD   down     
##  8   394367 Baton Rouge, LA up       
##  9   394378 Bellingham, WA  up       
## 10   394388 Birmingham, AL  down     
## # ... with 70 more rows

Now, you have a dataframe with outcomes (labels) for each county in the dataset.

The goal is then given predictors \(X_i\) for county \(i\), build a classifier for outcome \(G_i \in \{\mathtt{up},\mathtt{down}\}\).

For your classifiers you should use data up to 2016.

predictor_df <- tidy_afford %>%
  filter(year(time) <= 2016)

Your project

Your goal for this project is to do an experiment to address a (one, single) technical question about our ability to make this prediction. There is a list of possible questions you may address below. Each of them asks two compare two specific choices in the classification workflow (e.g., two classification algorithms, two feature representations, etc.). You will implement each of the two choices and use 10-fold cross validation (across RegionID’s) to compare their relative performance. You will also create an AUROC curve to compare them.

Possible Questions

Feature representation and preprocessing

  • Does standardizing affordability for each region affect prediction performance? Compare standardized to non-standardized affordability.
  • Is using quarter to quarter change (continuous or discrete) improve prediction performance? Compare quarter to quarter change in affordability as predictors to affordability as predictor?
  • Should we use the full time series for each region, or should we use only the last few years? Compare full time series to a subset of the time series?
  • Should we expand the training set to multiple time series per region? For example, create a similar outcome for each time point in the dataset (change relative to affordability one year ago) and use data from the last couple of years as predictors. Train on the extended dataset and test on the 2017 data above?
  • Should we do dimensionality reduction (PCA) and use the embedded data to do prediction?
  • Create your own question!

Classification Algorithm

  • Is a decision tree better than logistic regression?
  • Is a random forest better than a decision tree?
  • Is K-nearest neighbors bettern than a random forest?
  • Create your own question!

Algorithm tuning

  • Does tuning hyper-parameters using cross-validation improve performance?

Note that you still have to make some choices regardless of the question you choose. For example, to do the feature preprocessing and representation experiments you have to choose a classifier (random forest for example), and decide what to do about hyper-parameters if appropriate.

Submission

Prepare an Rmarkdown file which includes:

  1. Code to prepare data (copied from chunks above)
  2. Discussion of the question you have chosen to address including discussion of other choices you have made (e.g., feature representation, classification algorithm) to carry out your experiment.
  3. Code to carry out your cross-validation experiment.
  4. Table (result of hypothesis testing difference between algorithms) and AUROC curve plot.
  5. Interpretation and discussion of your experiment results.

An example classifier

Let’s build a Random Forest classifer using quarterly differences after data standardization for years 2014-2016. First, filter to the years of interest and standardize affordability for each region

standardized_df <- predictor_df %>%
  filter(year(time) %in% 2014:2016) %>%
  group_by(RegionID) %>%
  mutate(mean_aff = mean(affordability)) %>%
  mutate(sd_aff = sd(affordability)) %>%
  mutate(z_aff = (affordability - mean_aff) / sd_aff) %>%
  ungroup()

To train our model we need a table with one row per region, and attributes corresponding to differences in quarterly affordability. We will do this in stages, first we turn the tidy dataset into a wide dataset using tidyr::spread then create a dataframe containing the differences we use as features.

wide_df <- standardized_df %>%
  select(RegionID, time, z_aff) %>%
  tidyr::spread(time, z_aff)

Now, we turn this into quarterly differences

matrix_1 <- wide_df %>%
  select(-RegionID) %>%
  as.matrix() %>%
  .[,-1]

matrix_2 <- wide_df %>%
  select(-RegionID) %>%
  as.matrix() %>%
  .[,-ncol(.)]

diff_df <- (matrix_1 - matrix_2) %>%
  magrittr::set_colnames(NULL) %>%
  as_data_frame() %>%
  mutate(RegionID = wide_df$RegionID)

Finally, add the outcome we want to predict from the data frame we created previously.

final_df <- diff_df %>%
  inner_join(outcome_df %>% select(RegionID, Direction), by="RegionID") %>%
  mutate(Direction=factor(Direction, levels=c("down", "up")))
final_df
## # A tibble: 80 x 13
##        V1      V2     V3      V4    V5      V6      V7     V8      V9
##     <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 -0.482 -0.224  -1.13  -0.113  1.22   0.0828  0.270  -0.869 -0.187 
##  2 -0.656 -0.0920 -1.01  -0.153  0.757 -0.0590 -0.138  -0.908 -0.449 
##  3 -0.423 -0.251  -1.20  -0.387  0.646 -0.0271 -0.110  -0.718 -0.498 
##  4 -0.308  0.0988 -0.938  0.179  1.26   0.0681  0.392  -0.874 -0.361 
##  5 -0.371  0.0929 -1.03   0.150  1.60  -0.107   0.307  -0.541 -0.384 
##  6 -0.496  0.239  -0.738  0.362  1.04  -0.0637  0.465  -0.607 -0.0931
##  7 -0.472 -0.118  -1.05  -0.458  0.509 -0.285   0.116  -0.844 -0.327 
##  8 -0.485  0.261  -0.661 -0.0810 0.921  0.0145  0.427  -0.213  0.231 
##  9 -0.438  0.527  -1.11   0.0971 0.720  0.191   0.186  -0.219  0.196 
## 10 -0.919 -0.480  -1.01  -0.170  0.814 -0.407   0.0552 -0.684 -0.441 
## # ... with 70 more rows, and 4 more variables: V10 <dbl>, V11 <dbl>,
## #   RegionID <int>, Direction <fct>

Now split up the data set into training regions and testing regions (using 80/20 random split).

set.seed(1234)
test_random_forest_df <- final_df %>%
  group_by(Direction) %>%
  sample_frac(.2) %>%
  ungroup()

train_random_forest_df <- final_df %>%
  anti_join(test_random_forest_df, by="RegionID")

Let’s learn the random forest using default parameters.

library(randomForest)

rf <- randomForest(Direction~., data=train_random_forest_df %>% select(-RegionID))
rf
## 
## Call:
##  randomForest(formula = Direction ~ ., data = train_random_forest_df %>%      select(-RegionID)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 42.19%
## Confusion matrix:
##      down up class.error
## down    4 19    0.826087
## up      8 33    0.195122

Now, let’s make predictions on the held out test set

test_predictions <- predict(rf, newdata=test_random_forest_df %>% select(-RegionID))

Make a confusion matrix and calculate error rate.

table(pred=test_predictions, observed=test_random_forest_df$Direction)
##       observed
## pred   down up
##   down    2  1
##   up      4  9

Error rate is 31.25%

Cross-validation

Now, same example, but looking at 5-fold cross-validation to compare a random forest with 500 trees, with a random forest with 10 trees. Since this is a smallish dataset, I am using 5-fold cross validation to make the validation sets have more examples and therefore more reliable performance estimates.

First, we create data frame to contain the results of the experiment. That is, for each observation (region) in the training set (rows) we will have attributes for predictions (as continuous values) for both random forests, the observed labels (‘up’ or ‘down’) and index of the cross-validation fold in which the observation was used as a test example. We will then use this data frame to compute error rates, do a hypothesis test for differences in error and to create an ROC curve.

We use the caret::createFolds function to create a stratified 5-fold cross-validation partition. The result is a list of length 5, containing the indices of the examples used for validation in each of the cross-validation folds.

We pipe the result to the purrr::imap function which applies to each list (and it’s index) a function that (a) splits the data into train and test sets, (b) fits the two random forests, and (c) creates a data frame with the values we need.

The result of purrr::imap is a list of data frames, so we pipe it to purrr::reduce to create a single data frame.

set.seed(1234)

library(caret)
# create the cross-validation partition
result_df <- createFolds(final_df$Direction, k=5) %>%
  # fit models and gather results
  purrr::imap(function(test_indices, fold_number) {
    # split into train and test for the fold
    train_df <- final_df %>%
      select(-RegionID) %>%
      slice(-test_indices)

    test_df <- final_df %>%
      select(-RegionID) %>%
      slice(test_indices)
  
    # fit the two models
    rf1 <- randomForest(Direction~., data=train_df, ntree=500)
    rf2 <- randomForest(Direction~., data=train_df, ntree=10)
    
    # gather results
    test_df %>%
      select(observed_label = Direction) %>%
      mutate(fold=fold_number) %>%
      mutate(prob_positive_rf1 = predict(rf1, newdata=test_df, type="prob")[,"up"]) %>%
      # add predicted labels for rf1 using a 0.5 probability cutoff
      mutate(predicted_label_rf1 = ifelse(prob_positive_rf1 > 0.5, "up", "down")) %>%
      mutate(prob_positive_rf2 = predict(rf2, newdata=test_df, type="prob")[, "up"]) %>%
      # add predicted labels for rf2 using a 0.5 probability cutoff
      mutate(predicted_label_rf2 = ifelse(prob_positive_rf2 > 0.5, "up", "down"))
}) %>%
  # combine the five result data frames into one
  purrr::reduce(bind_rows)
result_df
## # A tibble: 80 x 6
##    observed_label fold  prob_positive_r… predicted_label… prob_positive_r…
##    <fct>          <chr>            <dbl> <chr>                       <dbl>
##  1 down           Fold1            0.542 up                            0.4
##  2 down           Fold1            0.652 up                            0.8
##  3 down           Fold1            0.62  up                            0.6
##  4 up             Fold1            0.708 up                            0.8
##  5 down           Fold1            0.756 up                            0.8
##  6 up             Fold1            0.636 up                            0.8
##  7 up             Fold1            0.522 up                            0.4
##  8 up             Fold1            0.772 up                            1  
##  9 up             Fold1            0.66  up                            0.7
## 10 up             Fold1            0.474 down                          0.4
## # ... with 70 more rows, and 1 more variable: predicted_label_rf2 <chr>

Now let’s compute error rates for each model on each fold, and do a test for differences in error rate

result_df %>%
  mutate(error_rf1 = observed_label != predicted_label_rf1,
         error_rf2 = observed_label != predicted_label_rf2) %>%
  group_by(fold) %>%
  summarize(big_rf = mean(error_rf1), small_rf = mean(error_rf2)) %>%
  tidyr::gather(model, error, -fold) %>%
  lm(error~model, data=.) %>%
  broom::tidy()
##            term   estimate std.error statistic      p.value
## 1   (Intercept) 0.41225490 0.0596994 6.9055114 0.0001238381
## 2 modelsmall_rf 0.03392157 0.0844277 0.4017824 0.6983639992

Now, let’s make an ROC curve. When making ROC curves based on a cross-validation experiment, we need to aggregate across the five folds in some way. The ROCR package provides facilities to do this in the plot function. See comments in code below to see how aggregation is done. For this to work properly the ROCR::prediction function takes lists of predictions and labels instead of vectors to create ROC curves and compute AUROC. We use the split function to create these lists.

library(ROCR)

# create a list of true observed labels 
labels <- split(result_df$observed_label, result_df$fold)

# now create a list of predictions for the first RF and pass it to the ROCR::prediction function
predictions_rf1 <- split(result_df$prob_positive_rf1, result_df$fold) %>% prediction(labels)

# do the same for the second RF
predictions_rf2 <- split(result_df$prob_positive_rf2, result_df$fold) %>% prediction(labels)

# compute average AUC for the first RF
mean_auc_rf1 <- predictions_rf1 %>%
  performance(measure="auc") %>%
  # I know, this line is ugly, but that's how it is
  slot("y.values") %>% unlist() %>% 
  mean()

# compute average AUC for the second RF
mean_auc_rf2 <- predictions_rf2 %>%
  performance(measure="auc") %>%
  slot("y.values") %>% unlist() %>% 
  mean()

# plot the ROC curve for the first RF
predictions_rf1 %>%
  performance(measure="tpr", x.measure="fpr") %>%
  plot(avg="threshold", col="orange", lwd=2)

# plot the ROC curve for the second RF
predictions_rf2 %>%
  performance(measure="tpr", x.measure="fpr") %>%
  plot(avg="threshold", col="blue", lwd=2, add=TRUE)

# add a legend to the plot
legend("bottomright",
       legend=paste(c("big", "small"), "rf, AUC:", round(c(mean_auc_rf1, mean_auc_rf2), digits=3)),
       col=c("orange", "blue"))