Quantifying the World

Case Study - Data Imputation - Boston Housing Data

Stacy Conant

March 16, 2020

Back to Top

Introduction

When working with real-world datasets, it is often the case that the data will require much cleaning a preparation to be useful. This is especially the case when portions of data that are missing. An analyst should examine missing data to try to understand why it is missing. Then they must carefully decide how to handle the missing values: could the records that have missing data be removed or imputed? If imputed, what method should be used? How these issues are handled can greatly affect the final data analysis.

For this case study, three different patterns of missing data are created using the Boston Housing Dataset from sklearn: a missing completely at random model (MCAR), a missing at random model (MAR) and a missing not at random model (MNAR). A baseline regression model with the full dataset was created so that the effects of imputation on missing data can be compared. Comparison will be made with Root Mean Square Error and $R^2$. One might expect that the more data that is imputed, the more error would be introduced to the regression and the fit would be affected adversely.

Back to Top

Data

This Boston Housing Dataset was available from Sklearn. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Each record in the database describes a Boston suburb or town. There are 506 records and 14 attributes. The target variable is the median value of a home, or MEDV. The attributes are defined as follows:

Attribute Drescription
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per 10,000 dollars
PTRATIO pupil-teacher ratio by town
B $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
LSTAT percentage lower status of the population
MEDV Median value of owner-occupied homes in $1000's

Back to Top

Analysis

In [1]:
#import packages
import numpy as np
import pandas as pd
from numpy.random import randn
np.set_printoptions(precision=5, suppress=True)
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston #dataet
import statsmodels.api as sm
import random
from prettytable import PrettyTable
In [2]:
# load dataset
bostondf = load_boston()
print(bostondf.data.shape)

# create data frame
boston = pd.DataFrame(bostondf.data, columns=bostondf.feature_names)

# create data frame for the target variable - median value of homes
target = pd.DataFrame(bostondf.target, columns=["MEDV"])
(506, 13)

After importing necessary packages and loading data, next is to take a quick look at the data. Below are the descriptive statistics for each of the variables, which might be useful when trying to decide whether to impute with mean, median, or some other value.

In [3]:
#get variable info
boston.describe()
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

The target variable is median house values in $1000's.

In [4]:
#MEDV info
target.describe()
Out[4]:
MEDV
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
In [5]:
#plot target variable
plt.hist(bostondf['target'], bins=20)
Out[5]:
(array([ 9., 12., 18., 37., 40., 42., 83., 71., 72., 12., 23., 18., 16.,
        14.,  7.,  1.,  5.,  5.,  2., 19.]),
 array([ 5.  ,  7.25,  9.5 , 11.75, 14.  , 16.25, 18.5 , 20.75, 23.  ,
        25.25, 27.5 , 29.75, 32.  , 34.25, 36.5 , 38.75, 41.  , 43.25,
        45.5 , 47.75, 50.  ]),
 <a list of 20 Patch objects>)

Figure 1: Plot of Distribution of the Target Variable - The median value of homes (MEDV).

Back to Top

Baseline Model

Using Sklearn get the Boston Housing dataset. Fit a linear regressor to the data as a baseline. There is no need to do Cross-Validation. We are exploring the change in results. What is the loss and what are the goodness of fit parameters? This will be our baseline for comparison for future tests.

First, a baseline model will be created for comparison. Statsmodel's OLS() function is used to get the $R^2$ value and the Root Mean Square Error, or RMSE, for the linear regression model. $R^2$ is a statistical measure of the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. An $R^2$ closer to 1 indicates a better fit. The RSME is an error metric of the differences between values (sample or population values) predicted by a model. The lower score is preferred for RMSE.

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#base model with MEDV as target
X = sm.add_constant(boston)
regOLS = sm.OLS(target,X).fit()

#linear regression prediction
sm_pred = regOLS.predict(X)
base_RMSE = np.sqrt(mean_squared_error(target, sm_pred))
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

Back to Top

First Imputation Model - Step 2

Missing Completely at Random

For select between 1, 5 10, 20, 33, and 50% of your data on a single column (Completely at random), replace the present value with a NAN and then perform an imputation of that value.

For this model, LSTAT was chosen as the variable to be imputed. Functions are created to drop random instances of LSTAT of a certain percentage (1%, 5%, 10%, etc.) of the entire dataset and then replace them with the mean of the variable which is 12.65. After imputation, the regression model is run again for comparison with the baseline.

In [7]:
#function for dropping values from full dataset
def naDrop(a, percent):
    # creates a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask - will drop values at random for a single column
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat
In [8]:
#function for imputation and regression
def step2_impute(data_loss):
    boston1 = boston.copy()
    data1 = boston1['LSTAT']
    modified1 = naDrop(data1, data_loss)
    boston1['LSTAT'] = modified1

    # replace Nan with the mean of the variable
    boston1['LSTAT'].fillna(boston1['LSTAT'].mean(), inplace=True)
    
    #regression model with MEDV as target
    X = sm.add_constant(boston1) 
    regOLS = sm.OLS(target,X).fit() 
 
    #linear regression prediction
    sm_pred = regOLS.predict(X) 
    RMSE = np.sqrt(mean_squared_error(target, sm_pred))
    return((RMSE,regOLS._results.rsquared))
In [9]:
#function defines what percentage of variables to drop/impute & display results
def step2_fit_loss():
    losses = [0.01, 0.05, 0.10, 0.20, 0.33, 0.5]
    #Display in table
    table = PrettyTable(['Imputation', 'RMSE', 'R2'])
    for loss in losses:
        RMSE, r2_test = step2_impute(loss)
        percentage = loss * 100
        table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
    print(table)
In [10]:
#run functions and print results
step2_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
+------------+--------+--------+
| Imputation |  RMSE  |   R2   |
+------------+--------+--------+
|    1.0%    | 4.688  | 0.7397 |
|    5.0%    | 4.6963 | 0.7387 |
|   10.0%    | 4.7962 | 0.7275 |
|   20.0%    | 4.7451 | 0.7333 |
|   33.0%    | 4.8744 | 0.7186 |
|   50.0%    | 5.1002 | 0.6919 |
+------------+--------+--------+
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

As expected, the baseline model still has the best RMSE and $R^2$ scores. Though there aren't large differences in the RMSE and $R^2$ results for imputed data, the lowest RMSE is the model with 1% of the LSTAT variable imputed with the mean. The error rises and the $R^2$ lowers as the percentage of missing data increases, except for the 20% imputation model which is better than the 10% model. The model with the highest error and lowest $R^2$, unsurprisingly, is the model with 50% of the LSTAT variable imputed.

Back to Top

Second Imputation Model - Step 3

Missing at Random

Take 2 different columns and create data “Missing at Random” when controlled for a third variable (i.e if Variable Z is > 30, than Variables X, Y are randomly missing). Make runs with 10%, 20% and 30% missing data imputed via your best guess. Repeat your fit and comparisons to the baseline.

For the second imputation model, at distances over 3 miles of the DIS variable, the NOX and ZN variables are set to drop random instances of a certain percentage (10%, 20%, 30%) of the entire dataset. These variables are then replaced with the mean or median of the variable.

In [11]:
#function for dropping values from full dataset
def step3_drop(a, percent):
    mat = a.copy()
    prop = int(mat.size * percent)
    mask = random.sample(range(mat.size), prop)
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat
In [12]:
#function for imputation and regression
def step3_impute(data_loss):
    boston1 = boston.copy()
    #first variable
    data1 = boston1['NOX'].loc[boston1['DIS'] > 3 ]
    modified1 = step3_drop(data1, data_loss)
    boston1['NOX'] = modified1
    
    # impute Nan with the mean for first variable
    boston1['NOX'].fillna(boston1['NOX'].mean(), inplace=True)
    
    #second variable
    data1 = boston1['ZN'].loc[boston1['DIS'] > 3 ]
    modified1 = step3_drop(data1, data_loss)
    boston1['ZN'] = modified1
    # replace Nan with the median for second variable
    boston1['ZN'].fillna(boston1['ZN'].median(), inplace=True)
    
    #regression model with MEDV as target
    X = sm.add_constant(boston1) 
    regOLS = sm.OLS(target,X).fit()
     
 
    #linear regression prediction
    sm_pred = regOLS.predict(X) 
    RMSE = np.sqrt(mean_squared_error(target, sm_pred))
    return((RMSE,regOLS._results.rsquared))
In [13]:
#function defines what percentage of variables to drop/impute & display results
def step3_fit_loss():
    losses = [0.10, 0.20, 0.30]
    table = PrettyTable(['Imputation', 'RMSE', 'R2'])
    for loss in losses:
        RMSE, r2_test = step3_impute(loss)
        percentage = loss * 100
        table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
    print(table)
In [16]:
#run functions and print results
step3_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
+------------+--------+--------+
| Imputation |  RMSE  |   R2   |
+------------+--------+--------+
|   10.0%    | 4.7829 | 0.729  |
|   20.0%    | 4.7901 | 0.7282 |
|   30.0%    | 4.8121 | 0.7257 |
+------------+--------+--------+
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

These MAR imputation models behave as expected, with the RMSE and $R^2$ scores becoming worse as the percentage of imputed data rises. The lowest RMSE and highest $R^2$ is the model with 10% of the NOX and ZN variables imputed with the mean or median. This 10% model is closest to the baseline model, but the baseline is still better.

Back to Top

Third Imputation Model - Step 4

Missing Not at Random

Create a Missing Not at Random pattern in which 25% of the data is missing for a single column. Impute your data, fit the results and compare to a baseline.

For this imputation model, the AGE variable was chosen. To substitute a quarter of the data for AGE not at random, the first quartile of the data (less than age 45) was replaced by NaN. These instances were then imputed with the mean of the entire variable. The mean AGE is 68.57, which is quite a difference from 45.

In [17]:
#create copy and drop ages 45 and under
boston1 = boston.copy()
boston1.loc[boston1.AGE <= 45,['AGE']] = np.nan
In [18]:
#show null value count
print(boston1.isnull().sum())
CRIM         0
ZN           0
INDUS        0
CHAS         0
NOX          0
RM           0
AGE        127
DIS          0
RAD          0
TAX          0
PTRATIO      0
B            0
LSTAT        0
dtype: int64
In [19]:
    
#impute Nan with the mean for age variable
boston1['AGE'].fillna(boston1['AGE'].mean(), inplace=True)

#regression model with MEDV as target
X = sm.add_constant(boston1) 
regOLS = sm.OLS(target,X).fit()

#linear regression prediction 
sm_pred = regOLS.predict(X) 
RMSE = np.sqrt(mean_squared_error(target, sm_pred))

#print baseline
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
#print results
print('---------------------')
print('R Square', regOLS._results.rsquared)
print('RMSE:',RMSE)
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516
---------------------
R Square 0.7421573514330941
RMSE: 4.665507684073446

The result of this MNAR model's linear regression are close to the baseline. Surprisingly, the $R^2$ and RMSE actually indicate a slightly better fit than the baseline. For this dataset, substituting ages under 45 with the mean age 68 did not greatly affect the outcome of the multiple regression.

Back to Top

Conclusion

Imputation is a must for most datasets, but should be thoroughly thought through. Changes to NaN data can cause changes to the results and conclusions drawn. In this example, generally the higher the percentage of data that was imputed, the "worse" the regression model result. Though for most results, the RMSE and $R^2$ did not deviate greatly from the baseline. Perhaps, different variables would have been affected by imputation more and impacted the results of regressions to a larger extent. In the future, it would be interesting to try, not only other variables, but also other imputation techniques to see how it effects analysis.