Here we duplicate the end-to-end project from ML book. This analysis looks at housing data in California. It will build predictors for housing value (a continuous outcome) based on other attributes.
This function downloads dataset
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
This function loads the dataset as apandas DataFrame object
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
fetch_housing_data()
housing = load_housing_data()
housing.head()
housing.info()
Let's print a table of value counts for the "ocean_proximity" attribute
housing["ocean_proximity"].value_counts()
Use the describe
method to see a summary of all variables
housing.describe()
Let's make histograms of all variables
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
Next, let's create a test set which we will ignore during system development. We think that median income will be important for predicting house prices so we will make sure that both test and train set retain the same distribution of median income (as much as possible). To do so, instead of using random sampling to create the test set, we will use stratified sampling using a categorical version of the median income attribute to stratify the dataset.
First, we create a categorical median income attribute
import numpy as np
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
Let's take a look at the distribution of the categorical income attribute
housing["income_cat"].value_counts() / len(housing)
Next, we use a function from the scikit.learn package for stratified sampling and verify that the train and test sets both have similar distribution of categorical income as the overall dataset.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1234)
for train_index, test_index in split.split(housing, housing["income_cat"]):
train_set = housing.loc[train_index]
test_set = housing.loc[test_index]
train_set["income_cat"].value_counts() / len(train_set)
test_set["income_cat"].value_counts() / len(test_set)
Looks good enough. Finally, remove the income categorical variable from the train and test sets since we don't need them anymore
for set in (train_set, test_set):
set.drop(["income_cat"], axis=1, inplace=True)
First, let's rename the training dataset to make it easier to refer to.
housing = train_set.copy()
Let's make a scatter plot of latitude and longitude using population for size, and house value for color
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("YlGnBu"), colorbar=True)
plt.legend()
Let's look at attribute correlation
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Median income has the highest correlation, total rooms and latitude also have high correlation to the outcome we are trying to predict. We can also make a scatterplot matrix.
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()
Correlation between median income (attribute) and median house value (outcome) is clear. Let's take a closer look.
housing.plot.scatter(x="median_income", y="median_house_value", alpha=0.1)
plt.show()
There are horizontal bands at some specific house values. This could be an issue from how the data was obtained. It may be difficult for our models to catch this properly.
Let's normalize a few of the attributes and see if that improves some of the correlations we observed.
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Normalizing total rooms improved correlation slightly, normalizing total bedrooms made a big difference, normalizing population did not.
Let's start over again with the training set and remove the outcome column so it is not transformed by our preprocessing pipeline
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()
Let's use sklearn pipelines to clean up and transform our data based on what we've observed after exploratory analysis. We will do the following:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, StandardScaler, LabelBinarizer
from sklearn_pandas import DataFrameMapper
cat_attributes = ['ocean_proximity']
num_attributes = [x for x in list(housing) if x not in cat_attributes]
# This transformer adds the normalized attributes we looked at above
class NormalizedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self):
self.rooms_ix = 3
self.bedrooms_ix = 4
self.population_ix = 5
self.household_ix = 6
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
rooms_per_household = X[:, self.rooms_ix] / X[:, self.household_ix]
population_per_household = X[:, self.population_ix] / X[:, self.household_ix]
bedrooms_per_room = X[:, self.bedrooms_ix] / X[:, self.rooms_ix]
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
num_pipeline = Pipeline([
('imputer', Imputer(strategy="median")),
('attribs_adder', NormalizedAttributesAdder()),
('std_scaler', StandardScaler()),
])
full_pipeline = DataFrameMapper([
(cat_attributes, LabelBinarizer()),
(num_attributes, num_pipeline)
])
Now that we have our pipeline in place, let's transform our training set
housing_clean = full_pipeline.fit_transform(housing)
housing_clean
housing_clean.shape
Let's train a linear regression model and a regression tree model
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
lin_reg = LinearRegression()
lin_reg.fit(housing_clean, housing_labels)
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_clean, housing_labels);
Now let's calculate the root mean squared error of each on the training set
from sklearn.metrics import mean_squared_error
linreg_predictions = lin_reg.predict(housing_clean)
linreg_mse = mean_squared_error(housing_labels, linreg_predictions)
linreg_rmse = np.sqrt(linreg_mse)
tree_predictions = tree_reg.predict(housing_clean)
tree_mse = mean_squared_error(housing_labels, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
print("linreg: ", linreg_rmse)
print("tree: ", tree_rmse)
Looks like the regression tree is probably overfitting the training set. Let's use cross-validation to estimate expected prediction error instead.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_clean, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Sd:", scores.std())
display_scores(tree_rmse_scores)
Clearly, the Regression tree is overfitting the training set. Let's see how linear regression fares:
scores = cross_val_score(lin_reg, housing_clean, housing_labels,
scoring="neg_mean_squared_error", cv=10)
linreg_rmse_scores = np.sqrt(-scores)
display_scores(linreg_rmse_scores)
In this case, linear regression does better than the regression tree based on the CV estimate. Let's try one more model, the random forest, which we'll see in a later lecture.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
scores = cross_val_score(forest_reg, housing_clean, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores)
display_scores(forest_rmse_scores)
This model looks promising. Let's now use grid search over tuning parameters for the random forest regressor using cross validation to perform model selection.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring="neg_mean_squared_error")
grid_search.fit(housing_clean, housing_labels)
We can get the best predictor after grid search, which was re-trained on the complete training set. Using that we can look at feature importance for that predictor.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(housing["ocean_proximity"])
feature_importances = grid_search.best_estimator_.feature_importances_
extra_attributes = ['rooms_per_household', 'pop_per_household', 'bedrooms_per_room']
cat_encoded_attributes = list(encoder.classes_)
attributes = num_attributes + extra_attributes + cat_encoded_attributes
sorted(zip(feature_importances, attributes), reverse=True)
Now that we have trained our random forest regressor using cross validation, we are ready to see how this does on the test set. In this case, we are not pursuing any further fine tuning of our model, e.g., removing some of the uninformative attributes, transforming attributes further (e.g., ignoring categories for proximity other than 'INLAND'), etc. If we were doing that, we would not be measuring performance on the test set. Remember, do not touch the test set until you are done developing the ML system.
Note that we will use the same preprocessing pipeline as before.
final_model = grid_search.best_estimator_
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()
X_test_clean = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_clean)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final RMSE:", final_rmse)