Project 4: Classification

Last Update: 2020-04-28

Data

We will use Mortgage Affordability data from Zillow to experiment with classification algorithms. The data was downloaded from Zillow Research page: https://www.zillow.com/research/data/

It is made available here: http://www.hcbravo.org/IntroDataSci/misc/Affordability_Wide_2017Q4_Public.csv

Download the csv file to your project directory.

Preparing data

First, we will tidy the data. Please include this piece of code in your submission.

library(tidyverse)
library(lubridate)
theme_set(theme_bw())

csv_file <- "Affordability_Wide_2017Q4_Public.csv"
tidy_afford <- read_csv(csv_file) %>%
  filter(Index == "Mortgage Affordability") %>%
  drop_na() %>%
  filter(RegionID != 0, RegionName != "United States") %>%
  dplyr::select(RegionID, RegionName, matches("^[1|2]")) %>%
  gather(time, affordability, matches("^[1|2]")) %>%
  type_convert(col_types=cols(time=col_date(format="%Y-%m")))
tidy_afford

## # A tibble: 12,480 x 4
##    RegionID RegionName                         time       affordability
##       <dbl> <chr>                              <date>             <dbl>
##  1   394913 New York, NY                       1979-03-01         0.262
##  2   753899 Los Angeles-Long Beach-Anaheim, CA 1979-03-01         0.358
##  3   394463 Chicago, IL                        1979-03-01         0.262
##  4   394514 Dallas-Fort Worth, TX              1979-03-01         0.301
##  5   394974 Philadelphia, PA                   1979-03-01         0.204
##  6   394692 Houston, TX                        1979-03-01         0.243
##  7   395209 Washington, DC                     1979-03-01         0.254
##  8   394856 Miami-Fort Lauderdale, FL          1979-03-01         0.268
##  9   394347 Atlanta, GA                        1979-03-01         0.248
## 10   394404 Boston, MA                         1979-03-01         0.222
## # … with 12,470 more rows

This is what the data looks like:

tidy_afford %>%
  ggplot(aes(x=time,y=affordability,group=factor(RegionID))) +
  geom_line(color="GRAY", alpha=3/4, size=1/2) +
  labs(title="County-Level Mortgage Affordability over Time",
          x="Date", y="Mortgage Affordability")

The prediction task

The prediction task we are going to answer is:

Can we predict if mortgage affordability will increase or decrease a year from now"

Specifically, we will do this for quarter 4 (Q4) of 2017. To create the outcome we will predict we will compare affordability for Q4 of 2017 to Q4 of 2016 and label it as up or down depending on the sign of the this difference. Let’s create the outcome we want to predict (again, copy this bit of code to your submission):

outcome_df <- tidy_afford %>%
  mutate(yq = quarter(time, with_year=TRUE)) %>%
  filter(yq %in% c("2016.4", "2017.4")) %>%
  select(RegionID, RegionName, yq, affordability) %>%
  spread(yq, affordability) %>%
  mutate(diff = `2017.4` - `2016.4`) %>%
  mutate(Direction = ifelse(diff>0, "up", "down")) %>%
  select(RegionID, RegionName, Direction)
outcome_df

## # A tibble: 80 x 3
##    RegionID RegionName      Direction
##       <dbl> <chr>           <chr>    
##  1   394304 Akron, OH       down     
##  2   394312 Albuquerque, NM down     
##  3   394318 Allentown, PA   down     
##  4   394347 Atlanta, GA     up       
##  5   394355 Austin, TX      up       
##  6   394357 Bakersfield, CA down     
##  7   394358 Baltimore, MD   down     
##  8   394367 Baton Rouge, LA up       
##  9   394378 Bellingham, WA  up       
## 10   394388 Birmingham, AL  down     
## # … with 70 more rows

Now, you have a dataframe with outcomes (labels) for each county in the dataset.

The goal is then given predictors \(X_i\) for county \(i\), build a classifier for outcome \(G_i \in \{\mathtt{up},\mathtt{down}\}\).

To train your classifiers you should use data up to 2016.

predictor_df <- tidy_afford %>%
  filter(year(time) <= 2016)

Your project

Your goal for this project is to do an experiment to address a (one, single) technical question about our ability to make this prediction. Here is a list of possible questions you may address below. Each of them asks to compare two specific choices in the classification workflow (e.g., two classification algorithms, two feature representations, etc.). You will implement each of the two choices and use 10-fold cross validation (across RegionID’s) to compare their relative performance. You will also create an AUROC curve to compare them.

Possible Questions

Feature representation and preprocessing

Does standardizing affordability for each region affect prediction performance? Compare standardized to non-standardized affordability.
Is using quarter to quarter change (continuous or discrete) improve prediction performance? Compare quarter to quarter change in affordability as predictors to affordability as predictor?
Should we use the full time series for each region, or should we use only the last few years? Compare full time series to a subset of the time series?
Should we expand the training set to multiple time series per region? For example, create a similar outcome for each time point in the dataset (change relative to affordability one year ago) and use data from the last couple of years as predictors. Train on the extended dataset and test on the 2017 data above?
Should we do dimensionality reduction (PCA) and use the embedded data to do prediction?
Create your own question!

Classification Algorithm

Is a decision tree better than logistic regression?
Is a random forest better than a decision tree?
Is K-nearest neighbors bettern than a random forest?
Create your own question!

Note that you still have to make some choices regardless of the question you choose. For example, to do the feature preprocessing and representation experiments you have to choose a classifier (random forest for example), and decide what to do about hyper-parameters if appropriate.

Submission

Prepare an Rmarkdown file which includes:

Code to prepare data (copied from chunks above), plus any additional data prep for your experiment.
Discussion of the question you have chosen to address including discussion of other choices you have made (e.g., feature representation, classification algorithm) to carry out your experiment.
Code to carry out your cross-validation experiment.
Table (result of hypothesis testing difference between algorithms) and plot comparing AUROCs
ROC curves for both experimental settings.
Interpretation and discussion of your experiment results.

Knit to PDF and submit on ELMS.

An example experiment

Question: Does the number of trees used in a random forest classifier affect performance (AUROC measured with 5-fold CV)?

Other factors: We are transforming input data to use quarterly differences after data standardization for years 2014-2016.

Data preparation

First, filter to the years of interest and standardize affordability for each region

standardized_df <- predictor_df %>%
  filter(year(time) %in% 2014:2016) %>%
  group_by(RegionID) %>%
  mutate(mean_aff = mean(affordability)) %>%
  mutate(sd_aff = sd(affordability)) %>%
  mutate(z_aff = (affordability - mean_aff) / sd_aff) %>%
  ungroup()
standardized_df

## # A tibble: 960 x 7
##    RegionID RegionName         time       affordability mean_aff  sd_aff   z_aff
##       <dbl> <chr>              <date>             <dbl>    <dbl>   <dbl>   <dbl>
##  1   394913 New York, NY       2014-03-01         0.258    0.246 0.00854  1.41  
##  2   753899 Los Angeles-Long … 2014-03-01         0.399    0.390 0.0111   0.799 
##  3   394463 Chicago, IL        2014-03-01         0.142    0.136 0.00422  1.26  
##  4   394514 Dallas-Fort Worth… 2014-03-01         0.123    0.127 0.00779 -0.503 
##  5   394974 Philadelphia, PA   2014-03-01         0.152    0.144 0.00518  1.68  
##  6   394692 Houston, TX        2014-03-01         0.117    0.119 0.00545 -0.384 
##  7   395209 Washington, DC     2014-03-01         0.185    0.177 0.00588  1.47  
##  8   394856 Miami-Fort Lauder… 2014-03-01         0.178    0.194 0.0131  -1.17  
##  9   394347 Atlanta, GA        2014-03-01         0.118    0.118 0.00384 -0.0115
## 10   394404 Boston, MA         2014-03-01         0.223    0.217 0.00662  0.875 
## # … with 950 more rows

To train our model we need a table with one row per region, and attributes corresponding to differences in quarterly affordability. We will do this in stages, first we turn the tidy dataset into a wide dataset using tidyr::spread then create a dataframe containing the differences we use as features.

wide_df <- standardized_df %>%
  select(RegionID, time, z_aff) %>%
  tidyr::spread(time, z_aff)
wide_df

## # A tibble: 80 x 13
##    RegionID `2014-03-01` `2014-06-01` `2014-09-01` `2014-12-01` `2015-03-01`
##       <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1   394304      0.755          0.272       0.0480      -1.09         -1.20 
##  2   394312      1.38           0.727       0.635       -0.378        -0.531
##  3   394318      1.70           1.28        1.02        -0.175        -0.562
##  4   394347     -0.0115        -0.319      -0.220       -1.16         -0.980
##  5   394355     -0.00220       -0.374      -0.281       -1.31         -1.16 
##  6   394357     -0.195         -0.691      -0.453       -1.19         -0.829
##  7   394358      1.68           1.21        1.09         0.0387       -0.419
##  8   394367     -0.153         -0.638      -0.377       -1.04         -1.12 
##  9   394378     -0.0408        -0.479       0.0482      -1.06         -0.965
## 10   394388      2.01           1.09        0.614       -0.399        -0.569
## # … with 70 more rows, and 7 more variables: `2015-06-01` <dbl>,
## #   `2015-09-01` <dbl>, `2015-12-01` <dbl>, `2016-03-01` <dbl>,
## #   `2016-06-01` <dbl>, `2016-09-01` <dbl>, `2016-12-01` <dbl>

Now, we turn this into quarterly differences

matrix_1 <- wide_df %>%
  select(-RegionID) %>%
  as.matrix() %>%
  .[,-1]

matrix_2 <- wide_df %>%
  select(-RegionID) %>%
  as.matrix() %>%
  .[,-ncol(.)]

diff_df <- (matrix_1 - matrix_2) %>%
  magrittr::set_colnames(NULL) %>%
  as_data_frame() %>%
  mutate(RegionID = wide_df$RegionID)

Finally, add the outcome we want to predict from the data frame we created previously.

final_df <- diff_df %>%
  inner_join(outcome_df %>% select(RegionID, Direction), by="RegionID") %>%
  mutate(Direction=factor(Direction, levels=c("down", "up"))) %>%
  select(-RegionID)
final_df

## # A tibble: 80 x 12
##        V1      V2     V3      V4    V5      V6      V7     V8      V9     V10
##     <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
##  1 -0.482 -0.224  -1.13  -0.113  1.22   0.0828  0.270  -0.869 -0.187   0.0372
##  2 -0.656 -0.0920 -1.01  -0.153  0.757 -0.0590 -0.138  -0.908 -0.449  -0.297 
##  3 -0.423 -0.251  -1.20  -0.387  0.646 -0.0271 -0.110  -0.718 -0.498  -0.388 
##  4 -0.308  0.0988 -0.938  0.179  1.26   0.0681  0.392  -0.874 -0.361  -0.201 
##  5 -0.371  0.0929 -1.03   0.150  1.60  -0.107   0.307  -0.541 -0.384  -0.347 
##  6 -0.496  0.239  -0.738  0.362  1.04  -0.0637  0.465  -0.607 -0.0931 -0.205 
##  7 -0.472 -0.118  -1.05  -0.458  0.509 -0.285   0.116  -0.844 -0.327  -0.402 
##  8 -0.485  0.261  -0.661 -0.0810 0.921  0.0145  0.427  -0.213  0.231   0.123 
##  9 -0.438  0.527  -1.11   0.0971 0.720  0.191   0.186  -0.219  0.196  -0.386 
## 10 -0.919 -0.480  -1.01  -0.170  0.814 -0.407   0.0552 -0.684 -0.441  -0.217 
## # … with 70 more rows, and 2 more variables: V11 <dbl>, Direction <fct>

Run the experiment

We will use but 5-fold cross-validation to compare a random forest with 500 trees, with a random forest with 10 trees. Since this is a smallish dataset, I am using 5-fold cross validation to make the validation sets have more examples and therefore more reliable performance estimates.

library(caret)
set.seed(1234)

# create the cross-validation partition
cv_partition <- createFolds(final_df$Direction,
                            k=5)

# setup training parameters
fit_control <- trainControl( ## 5-fold CV
  method = "cv",
  number = 5,
  #indexOut = cv_partition,
  summaryFunction=twoClassSummary,
  classProbs=TRUE,
  savePredictions=TRUE)


# a function to obtain performance data
# (tpr and fpr) over the given cross validation
# partitions, for the number of trees in the
# random forest
get_roc_data <- function(ntree, cv_partition) {
  mean_fpr <- seq(0, 1, len=100)
  aucs <- numeric(length(cv_partition))
  
  # iterate over folds
  res <- lapply(seq_along(cv_partition),  function(i) {
    # train the random forest 
    fit <- train(Direction~.,
                        data = final_df[-cv_partition[[i]],], # all but the holdout set
                        method = "rf",
                        ntree = ntree,
                        trControl = fit_control,
                        metric="ROC")
    
    # make predictions on the holdout set
    preds <- predict(fit, newdata=final_df[cv_partition[[i]],],type="prob")$up
    
    # compute tpr and fpr from the hold out set
    perf <- ROCR::prediction(preds, final_df$Direction[cv_partition[[i]]]) %>%
      ROCR::performance(measure="tpr", x.measure="fpr")

    fpr <- unlist(perf@x.values)
    tpr <- unlist(perf@y.values)
    
    # interpolate the roc curve over 0, 1 range
    interp_tpr <- approxfun(fpr, tpr)(mean_fpr)
    interp_tpr[1] <- 0.0
    
    # collect values for this fold
    data_frame(fold=rep(i, length(mean_fpr)), fpr=mean_fpr, tpr=interp_tpr)
  })
  
  # combine values across all folds
  # into a single data frame
  do.call(rbind, res)
}

# calculate area under the ROC curve
# from tpr and fpr values across folds
compute_auc <- function(curve_df) {
  curve_df %>% 
    group_by(fold) %>%
    summarize(auc=pracma::trapz(fpr, tpr))
}

# get performance data for random forest with
# 10 trees
small_curve_df <- get_roc_data(ntree = 10, cv_partition)
small_auc_df <- compute_auc(small_curve_df)

# get performance data for random forest with
# 500 trees
large_curve_df <- get_roc_data(ntree=500, cv_partition)
large_auc_df <- compute_auc(large_curve_df)

Now let’s compare models based on 5-CV area under the ROC curve (AUROC).

# combine performance data for both models
# into one data frame (adding column to indicate)
# which model was used
curve_df <- small_curve_df %>%
  mutate(model="small") %>%
  rbind(mutate(large_curve_df, model="large")) %>%
  mutate(model = factor(model, levels=c("small", "large")))

auc_df <- small_auc_df %>%
  mutate(model="small") %>%
  rbind(mutate(large_auc_df, model="large")) %>%
  mutate(model = factor(model, levels=c("small", "large")))

# plot distribution of 
ggplot(auc_df, aes(x=model, y=auc)) +
  geom_jitter(position=position_jitter(0.1)) +
  coord_flip() + 
  labs(title="AUC comparision",
       x="Model",
       y="Area under ROC curve")

We test for differences using linear regression.

library(broom)

model_tab <- auc_df %>%
  lm(auc~model,data=.) %>%
  tidy() 

model_tab %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.6447212	0.0437130	14.748964	0.0000004
modellarge	0.0145139	0.0618195	0.234779	0.8202779

We see that there is a small increase (1.5%) in average AUROC for the big model but it is not a statistically significant difference.

Finally, here are ROC curves of both models.

curve_df %>%
  group_by(model, fpr) %>%
  summarize(tpr = mean(tpr)) %>%
  ggplot(aes(x=fpr, y=tpr, color=model)) +
    geom_line() +
    labs(title = "ROC curves",
         x = "False positive rate",
         y = "True positive rate")

The bigger model has higher true positive rate at lower false positive rates. The smaller model is better at high false positive rates. Neither model is very good at this prediction task, and there is no clear advantage in using either of these for the analysis task.