Last Update: 2020-04-28

Data

We will use Mortgage Affordability data from Zillow to experiment with classification algorithms. The data was downloaded from Zillow Research page: https://www.zillow.com/research/data/

It is made available here: http://www.hcbravo.org/IntroDataSci/misc/Affordability_Wide_2017Q4_Public.csv

Download the csv file to your project directory.

Preparing data

First, we will tidy the data. Please include this piece of code in your submission.

library(tidyverse)
library(lubridate)
theme_set(theme_bw())
csv_file <- "Affordability_Wide_2017Q4_Public.csv"
tidy_afford <- read_csv(csv_file) %>%
  filter(Index == "Mortgage Affordability") %>%
  drop_na() %>%
  filter(RegionID != 0, RegionName != "United States") %>%
  dplyr::select(RegionID, RegionName, matches("^[1|2]")) %>%
  gather(time, affordability, matches("^[1|2]")) %>%
  type_convert(col_types=cols(time=col_date(format="%Y-%m")))
tidy_afford
## # A tibble: 12,480 x 4
##    RegionID RegionName                         time       affordability
##       <dbl> <chr>                              <date>             <dbl>
##  1   394913 New York, NY                       1979-03-01         0.262
##  2   753899 Los Angeles-Long Beach-Anaheim, CA 1979-03-01         0.358
##  3   394463 Chicago, IL                        1979-03-01         0.262
##  4   394514 Dallas-Fort Worth, TX              1979-03-01         0.301
##  5   394974 Philadelphia, PA                   1979-03-01         0.204
##  6   394692 Houston, TX                        1979-03-01         0.243
##  7   395209 Washington, DC                     1979-03-01         0.254
##  8   394856 Miami-Fort Lauderdale, FL          1979-03-01         0.268
##  9   394347 Atlanta, GA                        1979-03-01         0.248
## 10   394404 Boston, MA                         1979-03-01         0.222
## # … with 12,470 more rows

This is what the data looks like:

tidy_afford %>%
  ggplot(aes(x=time,y=affordability,group=factor(RegionID))) +
  geom_line(color="GRAY", alpha=3/4, size=1/2) +
  labs(title="County-Level Mortgage Affordability over Time",
          x="Date", y="Mortgage Affordability")