from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Show Code"></form>''')

1. Data Wrangling

Problem Statement

Buying a house can be a very challenging process. It takes time, patience, and a lot of research to find a house that you like and then negotiate the right price for it. There are several features that influence house price such as the building type, total number of rooms, garage size, masonary work, location, utilities, and many more. Can house price be estimated (or predicted) using these features? Can we identify features that influence house price the most? How important is Building Type in determining the price? What influences the price more: location or size (sq.ft.)? How much value does remodeling add to the house? These are a few questions among many more that I intend to explore and answer.

Dataset

Dataset is available on kaggle (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). It is the Ames Housing dataset, compiled by Dean De Cock. It has 1460 instances and 79 explanatory variables that describe almost every aspect of residential homes in Ames, Iowa.

In this part, we will explore the data. In the process, we will clean and augment the data with additional information from other sources. We will perform the following steps:

  • Identify and remove outliers

  • Fill missing values

  • Augment the dataset by collecting and adding data from other sources like Google’s Geocoding API and uszipcode library

# Import useful libraries
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

import matplotlib as mpl
import seaborn as sns
%matplotlib inline

sns.set(style ='white',font_scale=1.25)

# sklearn pipeline tools
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Set waring to 'ignore' to prevent them from prining on screen
import warnings
warnings.filterwarnings('ignore')
# Load the dataset, and look at first few rows
housing_raw = pd.read_csv('data/train.csv',index_col='Id')
housing = housing_raw.copy()
housing.head()
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 80 columns

# Datatype of each column
housing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuilt      1460 non-null   int64  
 19  YearRemodAdd   1460 non-null   int64  
 20  RoofStyle      1460 non-null   object 
 21  RoofMatl       1460 non-null   object 
 22  Exterior1st    1460 non-null   object 
 23  Exterior2nd    1460 non-null   object 
 24  MasVnrType     1452 non-null   object 
 25  MasVnrArea     1452 non-null   float64
 26  ExterQual      1460 non-null   object 
 27  ExterCond      1460 non-null   object 
 28  Foundation     1460 non-null   object 
 29  BsmtQual       1423 non-null   object 
 30  BsmtCond       1423 non-null   object 
 31  BsmtExposure   1422 non-null   object 
 32  BsmtFinType1   1423 non-null   object 
 33  BsmtFinSF1     1460 non-null   int64  
 34  BsmtFinType2   1422 non-null   object 
 35  BsmtFinSF2     1460 non-null   int64  
 36  BsmtUnfSF      1460 non-null   int64  
 37  TotalBsmtSF    1460 non-null   int64  
 38  Heating        1460 non-null   object 
 39  HeatingQC      1460 non-null   object 
 40  CentralAir     1460 non-null   object 
 41  Electrical     1459 non-null   object 
 42  1stFlrSF       1460 non-null   int64  
 43  2ndFlrSF       1460 non-null   int64  
 44  LowQualFinSF   1460 non-null   int64  
 45  GrLivArea      1460 non-null   int64  
 46  BsmtFullBath   1460 non-null   int64  
 47  BsmtHalfBath   1460 non-null   int64  
 48  FullBath       1460 non-null   int64  
 49  HalfBath       1460 non-null   int64  
 50  BedroomAbvGr   1460 non-null   int64  
 51  KitchenAbvGr   1460 non-null   int64  
 52  KitchenQual    1460 non-null   object 
 53  TotRmsAbvGrd   1460 non-null   int64  
 54  Functional     1460 non-null   object 
 55  Fireplaces     1460 non-null   int64  
 56  FireplaceQu    770 non-null    object 
 57  GarageType     1379 non-null   object 
 58  GarageYrBlt    1379 non-null   float64
 59  GarageFinish   1379 non-null   object 
 60  GarageCars     1460 non-null   int64  
 61  GarageArea     1460 non-null   int64  
 62  GarageQual     1379 non-null   object 
 63  GarageCond     1379 non-null   object 
 64  PavedDrive     1460 non-null   object 
 65  WoodDeckSF     1460 non-null   int64  
 66  OpenPorchSF    1460 non-null   int64  
 67  EnclosedPorch  1460 non-null   int64  
 68  3SsnPorch      1460 non-null   int64  
 69  ScreenPorch    1460 non-null   int64  
 70  PoolArea       1460 non-null   int64  
 71  PoolQC         7 non-null      object 
 72  Fence          281 non-null    object 
 73  MiscFeature    54 non-null     object 
 74  MiscVal        1460 non-null   int64  
 75  MoSold         1460 non-null   int64  
 76  YrSold         1460 non-null   int64  
 77  SaleType       1460 non-null   object 
 78  SaleCondition  1460 non-null   object 
 79  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9+ KB

We can see there are a lot of numerical and categorical features.
Lets first take look at the numerical (particularly continous) features to spot outliers.

# Created a list of continous features to check their distribution and spot potential outliers
cont_feats = ['SalePrice','LotFrontage',
              'LotArea','MasVnrArea',
              'BsmtFinSF1','BsmtFinSF2',
              'BsmtUnfSF','TotalBsmtSF',
              '1stFlrSF','2ndFlrSF',
              'LowQualFinSF','GrLivArea',
              'GarageArea','WoodDeckSF',
              'OpenPorchSF','EnclosedPorch',
              'ScreenPorch']

1.1. Distributions of Continous Features.

_=housing[cont_feats].hist(figsize = (15,20),bins=25)
_images/00-data-wrangling_9_0.png

1.2. Distribution of SalePrice (target variable)

from scipy import stats
sns.distplot(housing.SalePrice, kde=False, bins=30)
plt.text(x=5e5, y=100,s='Skew = %.3f' %(housing.SalePrice.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)
_images/00-data-wrangling_12_0.png

Distribution has a postive skew. Let apply a log tranform.

housing['log1p(SalePrice)'] = np.log1p(housing['SalePrice'])
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(housing.SalePrice, kde=False, bins=30)
plt.text(x=5e5, y=100,s='Skew = %.3f' %(housing.SalePrice.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)

plt.subplot(1,2,2)
sns.distplot(housing['log1p(SalePrice)'], kde=False, bins=30)
_=plt.text(x=12.5, y=150,s='Skew = %.3f' %(housing['log1p(SalePrice)'].skew()),fontdict={'fontsize':14})
_images/00-data-wrangling_15_0.png

Looks close to normal after tranforming.

1.3. Outlier Identification

1.3.1. Distribution of GrLivArea

sns.distplot(housing.GrLivArea, kde=False, bins=30)
plt.text(x=3000, y=150,s='Skew = %.3f' %(housing.GrLivArea.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)
_images/00-data-wrangling_19_0.png

Distribution has positve skew.

Lets also apply log tranform to GrLivArea.

# Initialize a transformer list object to keep track of all data preprocessing steps
transformers = []
def log1p(X,columns):
    X_new = X.copy()
    for col in columns:
        X_new['log1p({})'.format(col)] = np.log1p(X_new[col])
        #X_new.drop(col,axis=1,inplace=True)
    return X_new

from sklearn.preprocessing import FunctionTransformer
log_trans = FunctionTransformer(log1p,validate=False,
                                      kw_args=dict(columns=['GrLivArea']))
transformers.append(('log_trans',log_trans))
housing['log1p(GrLivArea)'] = np.log1p(housing.GrLivArea)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(housing['log1p(GrLivArea)'],kde=False,bins=int(np.sqrt(len(housing))))
plt.text(x=6, y=120,s='Skew = %.3f' %(housing['log1p(GrLivArea)'].skew()),fontdict={'fontsize':14})
plt.subplot(1,2,2)
plt.scatter(housing['log1p(GrLivArea)'], housing['log1p(SalePrice)'], c='blue',marker = 's',alpha=0.25)
plt.xlabel('log1p(GrLivArea)')
plt.ylabel('log1p(SalePrice)')
Text(0, 0.5, 'log1p(SalePrice)')
_images/00-data-wrangling_24_1.png

No apparent outliers in log of GrLivArea. Distrubution also looks normal.

1.3.2. Distribution of GarageArea

sns.distplot(housing.GarageArea, kde=False, bins=30)
plt.text(x=1000, y=150,s='Skew = %.3f' %(housing.GarageArea.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)
_images/00-data-wrangling_27_0.png

Distibution looks reasonably normal.

plt.scatter(housing.GarageArea, 
            housing.SalePrice, c='blue',marker = 's',alpha=0.25)
plt.xlabel('GarageArea')
plt.ylabel('SalePrice')
Text(0, 0.5, 'SalePrice')
_images/00-data-wrangling_29_1.png

No apparent outliers in the scatter plot

1.3.3. Distribution LotArea

sns.distplot(housing.LotArea, kde=False, bins=50)
plt.text(x=50000, y=150,s='Skew = %.3f' %(housing.LotArea.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)
_images/00-data-wrangling_32_0.png

Distribution is highly skewed due to outliers.

plt.scatter(housing.LotArea, 
            housing['log1p(SalePrice)'], c='blue',marker = 's',alpha=0.25)
plt.axvline(x=60000,color='red',linestyle='--')
plt.xlabel('LotArea')
plt.ylabel('log1p(SalePrice)')
Text(0, 0.5, 'log1p(SalePrice)')
_images/00-data-wrangling_34_1.png

LotArea > 60000 seem like outliers. Lets remove them.

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.scatter(housing.LotArea, 
            housing['log1p(SalePrice)'], 
            c='blue',marker = 's',alpha=0.25)
plt.axvline(x=60000,color='red',linestyle='--')
plt.xlabel('LotArea')
plt.ylabel('log1p(SalePrice)')

outlier = pd.DataFrame()
outlier['LotArea'] = housing.LotArea > 60000

plt.subplot(1,2,2)
plt.scatter(housing[~outlier.any(axis=1)].LotArea, housing[~outlier.any(axis=1)]['log1p(SalePrice)'], c='blue',marker = 's',alpha=0.25)
plt.xlabel('LotArea')
plt.ylabel('')
plt.title('After removing Lot Area > %.0f sq.ft.' %(60000))
Text(0.5, 1.0, 'After removing Lot Area > 60000 sq.ft.')
_images/00-data-wrangling_36_1.png
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(housing.LotArea, kde=False, bins=50)
plt.text(x=50000, y=150,s='Skew = %.3f' %(housing.LotArea.skew()),fontdict={'fontsize':14})
_=plt.xticks(rotation = 45)
_=plt.title('Before outlier removal')

plt.subplot(1,2,2)
sns.distplot(housing[~outlier.any(axis=1)]['LotArea'], kde=False, bins=50)
plt.text(x=15000, y=100,s='Skew = %.3f' %(housing[~outlier.any(axis=1)]['LotArea'].skew()),fontdict={'fontsize':14})
_=plt.title('After outlier removal')
_=plt.xticks(rotation = 45)
_images/00-data-wrangling_37_0.png

Skeweness improved considerably after removing outliers. Distribution has looks reasonably normal.

print('Number of examples before outlier removal: ',housing.shape[0])
print('Number of examples after outlier removal: ',housing[~outlier.any(axis=1)].shape[0])
print('%i examples are excluded' %(housing.shape[0]-housing[~outlier.any(axis=1)].shape[0]))
Number of examples before outlier removal:  1460
Number of examples after outlier removal:  1454
6 examples are excluded
def remove_outliers(X,columns,thresholds):
    outlier = pd.DataFrame()
    for column,threshold in zip(columns,thresholds):
        if column == 'GrLivArea':
            outlier[column] = X[column] > threshold
        elif column == 'LotArea':
            percentile = np.percentile(X[column],99)
            outlier[column] = X[column] > threshold
        else:
            print(column+' not processed for outliers')
    return X[~outlier.any(axis=1)]

outlier_remover = FunctionTransformer(remove_outliers,validate=False,
                                      kw_args=dict(columns=['LotArea'],
                                                   thresholds=[60000]))
outlier_remover.fit(housing) # Fit on training data

transformers.append(('outlier_remover',outlier_remover))

housing = outlier_remover.transform(housing) ## Use this line instead the next cell
#housing = housing[~outlier.any(axis=1)]
print('Dataset size:',housing.shape)
Dataset size: (1454, 82)

For more advanced methods of dealing with outliers, ref to:
1. https://pyod.readthedocs.io/en/latest/
2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest

In the next step we will treat missing values.

1.4. Missing Value Treatment

1.4.1. Categorical Features

Lets first treat missing values in the categorical features

MSSubclass is a categorical feature but it got loaded as a numerical feature, so lets change it to categorical.

def to_str(X,columns):
    for column in columns:
        X[column] = X[column].astype(str)
    return X

to_string = FunctionTransformer(to_str,validate=False,
                                      kw_args=dict(columns=["MSSubClass"]))
to_string.fit(housing) # fit on training data
transformers.append(('MSSubClass_to_str',to_string))
housing = to_string.transform(housing)
from collections import defaultdict

# Initialize a feature dictionary which will keep track of all features
FEATURES = defaultdict(list)
missing_vals_cat = pd.DataFrame(columns=['Feature', '# of missing vals','% of missing vals'])
for feature, numNan in housing.isnull().sum().iteritems():
    if housing[feature].dtypes == 'object':
        FEATURES['cat'].append(feature)
        if numNan != 0:
            missing_vals_cat = pd.concat([missing_vals_cat,pd.DataFrame([feature,numNan,(numNan/housing.shape[0])*100],
                                                                        index=['Feature', '# of missing vals','% of missing vals']).T])
            
missing_vals_cat.sort_values(by=['% of missing vals'],ascending=False).reset_index(drop=True)
Feature # of missing vals % of missing vals
0 PoolQC 1448 99.5873
1 MiscFeature 1402 96.4237
2 Alley 1363 93.7414
3 Fence 1173 80.674
4 FireplaceQu 690 47.4553
5 GarageType 81 5.57084
6 GarageFinish 81 5.57084
7 GarageQual 81 5.57084
8 GarageCond 81 5.57084
9 BsmtExposure 38 2.61348
10 BsmtFinType2 38 2.61348
11 BsmtQual 37 2.5447
12 BsmtCond 37 2.5447
13 BsmtFinType1 37 2.5447
14 MasVnrType 8 0.550206
15 Electrical 1 0.0687758

There are two categorical features, Electrical and MasVnrType, that have missing values. Both have insignificant amounts of missing values (0.072% and 0.51%), hence it will be reasonable to impute them with the most frequently occuring category.

All other features have “meaningful” missing values. For example, if a house does not have a basement, all features related basement are going to be null values. Hence, replace those missing values in another category called “Missing”.

1.4.1.1. Electrical & MasVnrType

fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,4))
fig.subplots_adjust(wspace=0.25)
sns.countplot(x='Electrical',data=housing,ax=ax1)
ax1.set_title('Electrical')
sns.countplot(x='MasVnrType',data=housing,ax=ax2)
ax2.set_title('MasVnrType')
Text(0.5, 1.0, 'MasVnrType')
_images/00-data-wrangling_52_1.png

SBrkr is the most frequent category in the Electrical.

None is the most frequent category in MasVnrType.

Impute the missing categories with most frequently occuring ones.

cat_mode_mapper = housing[FEATURES['cat']].mode().iloc[0].to_dict()

def impute_cat(X,feat_to_mode=['Electrical','MasVnrType'], mapper = cat_mode_mapper):
    """
    Imputes missing values in Electrical and MasVnrType with modes, 
    and replaces NaN values in all other categorical features with 'Missing'
    """
    for col, mode in mapper.items():
        if col in feat_to_mode:
            X[col].fillna(mode,inplace=True)
        else:
            X[col].fillna('Missing',inplace=True)
    return X

cat_Imputer = FunctionTransformer(impute_cat,validate=False,
                                      kw_args=dict(feat_to_mode=['Electrical','MasVnrType'],
                                                   mapper=cat_mode_mapper))
cat_Imputer.fit(housing)
transformers.append(('cat_Imputer',cat_Imputer))
housing = cat_Imputer.transform(housing)
# Check if there are missing values in the categorical features any more
housing[FEATURES['cat']].isnull().sum()
MSSubClass       0
MSZoning         0
Street           0
Alley            0
LotShape         0
LandContour      0
Utilities        0
LotConfig        0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
RoofStyle        0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinType2     0
Heating          0
HeatingQC        0
CentralAir       0
Electrical       0
KitchenQual      0
Functional       0
FireplaceQu      0
GarageType       0
GarageFinish     0
GarageQual       0
GarageCond       0
PavedDrive       0
PoolQC           0
Fence            0
MiscFeature      0
SaleType         0
SaleCondition    0
dtype: int64

1.4.2. Numerical Features

missing_vals_num = pd.DataFrame(columns=['Feature','# of missing values','% of missing values'])
for feature, numNan in housing.isnull().sum().iteritems():
    if housing[feature].dtypes!= 'object' and 'SalePrice' not in feature:
        FEATURES['num'].append(feature)
        if numNan != 0:
            missing_vals_num = pd.concat([missing_vals_num,
                                          pd.DataFrame([feature,numNan,
                                                        (numNan/housing.shape[0])*100],
                                                       index=['Feature','# of missing values','% of missing values']).T])

missing_vals_num.reset_index(drop = True).sort_values(by='# of missing values',ascending=False)
Feature # of missing values % of missing values
0 LotFrontage 256 17.6066
2 GarageYrBlt 81 5.57084
1 MasVnrArea 8 0.550206

1.4.2.1. MasVnrArea

MasVnrArea: Since the most common category (None) was imputed for the missing values in the cat feature, MasVnrType, it will make sense to impute missing values in MasVnrArea with median MasVnrArea value of houses that have “None” as the MasVnrType.

MedianMasVnrArea = housing[housing['MasVnrType'] == 'None']['MasVnrArea'].median()
print("Median MasVnrArea: ",MedianMasVnrArea)
Median MasVnrArea:  0.0

1.4.2.2. GarageYrBlt

GarageYrBlt: these are same rows which have missing values in other garage related features, indicating that these houses do not have a garage. Imputing median value is a safe option, as these houses have already been flagged as missing a garage in the categorical features.

MedianGarageYrBlt = housing['GarageYrBlt'].median()
print("Median GarageYrBlt: ",MedianGarageYrBlt)
Median GarageYrBlt:  1980.0

1.4.2.3. LotForntage

LotForntage: missing values needs to be handled more carefully as they amount to 17.73% of the total oberservation.

  • Let’s take a look at the other three features associated with Lot: LotConfig, LotShape, and LotArea

plt.figure(figsize = (15,5))
plt.subplots_adjust(wspace=0.25)
plt.subplot(1,2,1)
sns.boxplot(x='LotConfig',y='LotFrontage',data=housing,hue='LotShape')
plt.subplot(1,2,2)
sns.scatterplot(x='LotArea',y='LotFrontage',data=housing,marker='s',alpha=0.25)
<matplotlib.axes._subplots.AxesSubplot at 0x7f475d537f60>
_images/00-data-wrangling_63_1.png
  • Group the data by LotShape and LotConfig and compute group specific median LotFrontage.

  • Impute missing LotFrontage with group specific median LotFrontage values.

num_median_mapper = housing[FEATURES['num']].median().to_dict()
num_median_mapper['LotFrontage'] = housing.groupby(['LotConfig','LotShape'])['LotFrontage'].median().to_dict()

def impute_num(X,mapper=num_median_mapper):
    for feat,vals in mapper.items():
        if feat == 'LotFrontage':
            for key,val in vals.items():
                LotConfig,LotShape = key
                X.loc[(X['LotFrontage'].isnull()) & (X['LotConfig']==LotConfig) & (X['LotShape']==LotShape),['LotFrontage']] = val
            X.loc[X['LotFrontage'].isnull(),'LotFrontage'] = X['LotFrontage'].median()
        else:
            X[feat].fillna(vals,inplace=True)
    return X

num_imputer = FunctionTransformer(impute_num,validate=False,
                                      kw_args=dict(mapper=num_median_mapper))
num_imputer.fit(housing)

transformers.append(('num_imputer',num_imputer))

housing = num_imputer.transform(housing)
# Check if there are any missing values in the entire dataset anymore.
housing.isnull().sum().sum()
0

Data cleaning is done at this point.

Next, we will add more data to the dataset from other sources.

1.5. Data Augmentation

In this part, we will add more features to the dataset by gathering data from other sources like Google’s Geocoding API and uszipcode library.

# Number of unique categories per categorical feature
unique_cats = defaultdict(int)

for col in housing.columns:
    if housing[col].dtypes == 'object':
        unique_cats[col] = housing[col].nunique()
        #print('{0}:\t\t {1} unique categories'.format(col,housing_cat[col].nunique()))

num_unique_cats = pd.Series(unique_cats)
num_unique_cats.sort_values(ascending = False)
Neighborhood     25
Exterior2nd      16
MSSubClass       15
Exterior1st      15
Condition1        9
SaleType          9
Condition2        8
HouseStyle        8
RoofMatl          7
GarageType        7
Functional        7
BsmtFinType1      7
BsmtFinType2      7
Heating           6
RoofStyle         6
SaleCondition     6
Foundation        6
FireplaceQu       6
GarageCond        6
GarageQual        6
MSZoning          5
LotConfig         5
MiscFeature       5
Fence             5
BldgType          5
HeatingQC         5
ExterCond         5
Electrical        5
BsmtQual          5
BsmtCond          5
BsmtExposure      5
MasVnrType        4
LotShape          4
ExterQual         4
KitchenQual       4
PoolQC            4
GarageFinish      4
LandContour       4
Alley             3
LandSlope         3
PavedDrive        3
Utilities         2
Street            2
CentralAir        2
dtype: int64

Neighborhood has most number of unique categories. The categories are some sort of location names.

1.5.1. Google API: Geocoding

In this section, we will extract Latitude and Longitude coordinates, and zipcodes from Google’s Geocoding API.

import googlemaps
import json
# Import the googlemaps API key
with open('data/keys.json') as file:
    keys = json.load(file)
    
gmaps = googlemaps.Client(key=keys['geocoding'])
# List of neighborhoods
housing.Neighborhood.unique()
array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)
neighborhoods = {"CollgCr":"College Creek","Veenker":"Veenker",
                 "Crawfor":"Crawford","NoRidge":"Northridge",
                 "Mitchel":"Mitchell","Somerst":"Somerset",
                 "NWAmes":"Northwest Ames","OldTown":"Old Town",
                 "BrkSide":"Brookside","Sawyer":"Sawyer",
                 "NridgHt":"Northridge Heights","NAmes":"North Ames",
                 "SawyerW":"Sawyer West","IDOTRR":"Iowa DOT and Rail Road",
                 "MeadowV":"Meadow Village","Edwards":"Edwards",
                 "Timber":"Timberland","Gilbert":"Gilbert",
                 "StoneBr":"Stone Brook","ClearCr":"Clear Creek",
                 "NPkVill":"Northpark Villa","Blmngtn":"Bloomington Heights",
                 "BrDale":"Briardale","SWISU":"South & West of Iowa State University",
                 "Blueste":"Bluestem"}

Let’s create a separate dataframe with unique Neighborhood categories

geo_df = pd.DataFrame(housing.Neighborhood.unique(),columns=['Neighborhood'])
geo_df.head()
Neighborhood
0 CollgCr
1 Veenker
2 Crawfor
3 NoRidge
4 Mitchel
import re
def getGeoInfo(Neighborhood, output):
    '''
     Enter the Neighborhood and the kind of output you want (lat,lng, or zipcode)
    '''
    def getZipCode(formatted_address):
        substr = re.findall('IA \d{5}, USA',formatted_address)
        if len(substr) == 0:
            return 'Missing'
        else:
            return re.findall('\d{5}',substr[0])[0]

    geocode_result = gmaps.geocode(neighborhoods[Neighborhood]+', Ames, IA')
    
    i = 0
    while True:
        if i == len(geocode_result):
            idx = None
            break
        elif 'IA' in geocode_result[i]['formatted_address']:
            idx = i
            break
        i += 1
        
    lat = geocode_result[idx]['geometry']['location']['lat']
    lng = geocode_result[idx]['geometry']['location']['lng']
    zipcode = getZipCode(geocode_result[idx]['formatted_address'])

    if output == 'lat':
        return lat
    elif output == 'lng':
        return lng
    elif output == 'zipcode':
        return zipcode
    else:
        return 'invalid output!'
geo_df['Lat'] = geo_df['Neighborhood'].apply(getGeoInfo,output='lat')
geo_df['Lng'] = geo_df['Neighborhood'].apply(getGeoInfo,output='lng')
geo_df['zipcode'] = geo_df['Neighborhood'].apply(getGeoInfo,output='zipcode')
geo_df
Neighborhood Lat Lng zipcode
0 CollgCr 42.022197 -93.651510 Missing
1 Veenker 42.041304 -93.650302 50011
2 Crawfor 42.018614 -93.648898 50014
3 NoRidge 42.047831 -93.646745 50010
4 Mitchel 41.990308 -93.601053 50010
5 Somerst 42.052628 -93.644582 50010
6 NWAmes 42.038277 -93.625770 50010
7 OldTown 42.029046 -93.614340 50010
8 BrkSide 42.028653 -93.630386 50010
9 Sawyer 42.033903 -93.677066 50014
10 NridgHt 42.059773 -93.649471 50010
11 NAmes 42.049358 -93.622442 50010
12 SawyerW 42.024906 -93.659599 50014
13 IDOTRR 42.021997 -93.622018 50010
14 MeadowV 41.992448 -93.602645 50010
15 Edwards 42.015510 -93.685391 50014
16 Timber 41.999973 -93.649693 50014
17 Gilbert 42.106929 -93.649663 50105
18 StoneBr 42.059727 -93.637642 50010
19 ClearCr 42.036097 -93.648830 50011
20 NPkVill 42.030781 -93.631913 Missing
21 Blmngtn 42.056376 -93.644471 50010
22 BrDale 42.052795 -93.628821 50010
23 SWISU 42.026657 -93.646452 50011
24 Blueste 42.023090 -93.663783 50014

In geo_df, there are two Neighborhoods (CollgCr, NPkVill) for which Google’s gecoding API was not able to provide the zipcode.

Lets look at Lat vs. Lng scatter plot to see which neighborhoods are close to those with missing zipcode.

sns.scatterplot(x='Lat',y='Lng',
                data=geo_df,
                hue='zipcode',
                palette=['red','cyan','magenta','green','blue'],
                alpha=1,s=100)
plt.legend(loc='center right',bbox_to_anchor = (1.35,0.5))
_=plt.xticks(rotation=45)
_images/00-data-wrangling_81_0.png

It will be reasonable to impute missing zipcodes with the closest neighbors, but there is another library that can provide zipcode info based on geo coordinates.

It is the uszipcode library https://pypi.org/project/uszipcode/

This library can be used to extract demographic information of different locations in US.

1.5.2. uszipcode

In this section, we will impute the missing zipcode, as well as augment our dataset with more features like popution density, total population, median household income, total number of housing units, etc.

from uszipcode import SearchEngine
def getMoreFeat(a,b):
    '''
    Can take zipcode or [lat,lng] as inputs (a) and return requested information about location, 
    The information requested should be specifed as the second argument (b).

    Acceptable requests are:'lat', 'lng', 
                            'population', 'zipcode', 
                            'population_density','housing_units',
                            'occupied_housing_units', 'median_home_value'
                            'median_household_income'
    
    '''
    def getZipCode(a):
        geo_info = search.by_coordinates(a[0],a[1],radius=5)
        i = 0
        while True:
            if i == len(geo_info):
                idx = None
                break
            elif (geo_info[i].post_office_city == 'Ames, IA') and (geo_info[i].major_city == 'Ames') and (geo_info[i].common_city_list[0] == 'Ames'):
                idx = i
                break
            i += 1
            
        if idx == None:
            print("Coordinates don't belong to Ames, IA. NaN returned")
            return np.nan
        else:
            return geo_info[idx].to_dict()[b]
    
    def getRequest(zipcode,b):
        geo_info = search.by_zipcode(zipcode)
        return geo_info.to_dict()[b]
        
        
    search = SearchEngine(simple_zipcode=True)
    
    if b == 'zipcode': 
        return getZipCode(a)
    elif b in ['lat', 'lng', 'population', 'zipcode', 
                  'population_density','housing_units',
                  'occupied_housing_units', 'median_home_value',
                  'median_household_income']:
        return getRequest(a,b)
        
geo_df.loc[geo_df.zipcode=='Missing','zipcode'] = geo_df.loc[geo_df.zipcode=='Missing',['Lat','Lng']].apply(getMoreFeat,axis=1,b='zipcode')
geo_df['median_household_income'] = geo_df['zipcode'].apply(getMoreFeat,b='median_household_income')
geo_df['median_home_value'] = geo_df['zipcode'].apply(getMoreFeat,b='median_home_value')
geo_df
Neighborhood Lat Lng zipcode median_household_income median_home_value
0 CollgCr 42.022197 -93.651510 50010 48189.0 165300.0
1 Veenker 42.041304 -93.650302 50011 NaN NaN
2 Crawfor 42.018614 -93.648898 50014 37661.0 212500.0
3 NoRidge 42.047831 -93.646745 50010 48189.0 165300.0
4 Mitchel 41.990308 -93.601053 50010 48189.0 165300.0
5 Somerst 42.052628 -93.644582 50010 48189.0 165300.0
6 NWAmes 42.038277 -93.625770 50010 48189.0 165300.0
7 OldTown 42.029046 -93.614340 50010 48189.0 165300.0
8 BrkSide 42.028653 -93.630386 50010 48189.0 165300.0
9 Sawyer 42.033903 -93.677066 50014 37661.0 212500.0
10 NridgHt 42.059773 -93.649471 50010 48189.0 165300.0
11 NAmes 42.049358 -93.622442 50010 48189.0 165300.0
12 SawyerW 42.024906 -93.659599 50014 37661.0 212500.0
13 IDOTRR 42.021997 -93.622018 50010 48189.0 165300.0
14 MeadowV 41.992448 -93.602645 50010 48189.0 165300.0
15 Edwards 42.015510 -93.685391 50014 37661.0 212500.0
16 Timber 41.999973 -93.649693 50014 37661.0 212500.0
17 Gilbert 42.106929 -93.649663 50105 80703.0 165400.0
18 StoneBr 42.059727 -93.637642 50010 48189.0 165300.0
19 ClearCr 42.036097 -93.648830 50011 NaN NaN
20 NPkVill 42.030781 -93.631913 50010 48189.0 165300.0
21 Blmngtn 42.056376 -93.644471 50010 48189.0 165300.0
22 BrDale 42.052795 -93.628821 50010 48189.0 165300.0
23 SWISU 42.026657 -93.646452 50011 NaN NaN
24 Blueste 42.023090 -93.663783 50014 37661.0 212500.0

There are missing values in median_household_income and median_home_value column.

Lets impute missing values with most frequently occuring values.

geo_df['zipcode'] = geo_df.zipcode.astype(int)
geo_df.set_index('Neighborhood',inplace=True)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
geo_df = pd.DataFrame(imputer.fit_transform(geo_df),columns=geo_df.columns,index=geo_df.index)

geo_df.reset_index(inplace=True)
geo_df
Neighborhood Lat Lng zipcode median_household_income median_home_value
0 CollgCr 42.022197 -93.651510 50010.0 48189.0 165300.0
1 Veenker 42.041304 -93.650302 50011.0 48189.0 165300.0
2 Crawfor 42.018614 -93.648898 50014.0 37661.0 212500.0
3 NoRidge 42.047831 -93.646745 50010.0 48189.0 165300.0
4 Mitchel 41.990308 -93.601053 50010.0 48189.0 165300.0
5 Somerst 42.052628 -93.644582 50010.0 48189.0 165300.0
6 NWAmes 42.038277 -93.625770 50010.0 48189.0 165300.0
7 OldTown 42.029046 -93.614340 50010.0 48189.0 165300.0
8 BrkSide 42.028653 -93.630386 50010.0 48189.0 165300.0
9 Sawyer 42.033903 -93.677066 50014.0 37661.0 212500.0
10 NridgHt 42.059773 -93.649471 50010.0 48189.0 165300.0
11 NAmes 42.049358 -93.622442 50010.0 48189.0 165300.0
12 SawyerW 42.024906 -93.659599 50014.0 37661.0 212500.0
13 IDOTRR 42.021997 -93.622018 50010.0 48189.0 165300.0
14 MeadowV 41.992448 -93.602645 50010.0 48189.0 165300.0
15 Edwards 42.015510 -93.685391 50014.0 37661.0 212500.0
16 Timber 41.999973 -93.649693 50014.0 37661.0 212500.0
17 Gilbert 42.106929 -93.649663 50105.0 80703.0 165400.0
18 StoneBr 42.059727 -93.637642 50010.0 48189.0 165300.0
19 ClearCr 42.036097 -93.648830 50011.0 48189.0 165300.0
20 NPkVill 42.030781 -93.631913 50010.0 48189.0 165300.0
21 Blmngtn 42.056376 -93.644471 50010.0 48189.0 165300.0
22 BrDale 42.052795 -93.628821 50010.0 48189.0 165300.0
23 SWISU 42.026657 -93.646452 50011.0 48189.0 165300.0
24 Blueste 42.023090 -93.663783 50014.0 37661.0 212500.0
geo_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Neighborhood             25 non-null     object 
 1   Lat                      25 non-null     float64
 2   Lng                      25 non-null     float64
 3   zipcode                  25 non-null     float64
 4   median_household_income  25 non-null     float64
 5   median_home_value        25 non-null     float64
dtypes: float64(5), object(1)
memory usage: 1.3+ KB
##Adding new features to FEATURES dictionary
for col in geo_df:
    if col == 'Neighborhood':
        print(col+' already in FETAURES["cat"]')
    elif geo_df[col].dtypes == 'object':
        FEATURES['aug_cat'].append(col)
    else:
        FEATURES['aug_num'].append(col)
Neighborhood already in FETAURES["cat"]

Now that geo_df is ready, which includes geographical and demographical information, lets merge it with the main dataset.

def augment(X,geo_df):
    X.reset_index(inplace=True)
    X = X.merge(geo_df,how='left')
    X.set_index('Id',inplace=True)
    return X

augmenter = FunctionTransformer(augment,validate=False,kw_args=dict(geo_df=geo_df))
transformers.append(('augmenter',augmenter))

housing = augmenter.fit_transform(housing)
print(housing.shape)
(1454, 87)
test = pd.read_csv('data/test.csv',index_col='Id')
my_pipeline = Pipeline(transformers)
my_pipeline.fit(housing_raw)
housing_temp = my_pipeline.transform(housing_raw)

print('Manually and pipeline created datasets are equal:')
print(housing.drop('log1p(SalePrice)',axis=1).equals(housing_temp))

test_temp = my_pipeline.transform(test)
Manually and pipeline created datasets are equal:
True
# Add newly added numerical features to FEATURES dictionary
for feature in housing.columns:
    if feature not in FEATURES['cat']+FEATURES['num']+FEATURES['aug_num']:
        print(feature+' is missing in housing_temp')
SalePrice is missing in housing_temp
log1p(SalePrice) is missing in housing_temp

Correlations of the newly added geographic and demograpic features with the target variable.

plt.figure(figsize=(6,4))
sns.heatmap(housing[['log1p(SalePrice)','SalePrice','Lat','Lng','zipcode','median_household_income','median_home_value']].corr(),
           vmin=-1,vmax=1,cmap='coolwarm',annot=True,annot_kws=dict(size=12),fmt='.2f')
plt.title('Correlation Matrix')
Text(0.5, 1.0, 'Correlation Matrix')
_images/00-data-wrangling_99_1.png

Correlations are high, but lets keep these features for now. We will come back to this later.

print('Dimensions of the dataset',housing[FEATURES['num']+FEATURES['cat']+FEATURES['aug_num']].shape)
Dimensions of the dataset (1454, 85)

At this point, data is free of outliers and missing values.

with open('data/wrangled_data.pkl','wb') as file:
    pickle.dump((housing,FEATURES,transformers),file)