Quantifying the World¶

Case Study - Data Imputation - Boston Housing Data¶

Stacy Conant¶

March 16, 2020¶

¶

Contents¶

Introduction
Data
Analysis
Conclusions
References

Introduction¶

When working with real-world datasets, it is often the case that the data will require much cleaning a preparation to be useful. This is especially the case when portions of data that are missing. An analyst should examine missing data to try to understand why it is missing. Then they must carefully decide how to handle the missing values: could the records that have missing data be removed or imputed? If imputed, what method should be used? How these issues are handled can greatly affect the final data analysis.

For this case study, three different patterns of missing data are created using the Boston Housing Dataset from sklearn: a missing completely at random model (MCAR), a missing at random model (MAR) and a missing not at random model (MNAR). A baseline regression model with the full dataset was created so that the effects of imputation on missing data can be compared. Comparison will be made with Root Mean Square Error and $R^2$. One might expect that the more data that is imputed, the more error would be introduced to the regression and the fit would be affected adversely.

Back to Top

Data¶

This Boston Housing Dataset was available from Sklearn. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Each record in the database describes a Boston suburb or town. There are 506 records and 14 attributes. The target variable is the median value of a home, or MEDV. The attributes are deﬁned as follows:

Attribute	Drescription
CRIM	per capita crime rate by town
ZN	proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	proportion of non-retail business acres per town
CHAS	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX	nitric oxides concentration (parts per 10 million)
RM	average number of rooms per dwelling
AGE	proportion of owner-occupied units built prior to 1940
DIS	weighted distances to five Boston employment centres
RAD	index of accessibility to radial highways
TAX	full-value property-tax rate per 10,000 dollars
PTRATIO	pupil-teacher ratio by town
B	$1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
LSTAT	percentage lower status of the population
MEDV	Median value of owner-occupied homes in $1000's

Back to Top

Analysis¶

#import packages
import numpy as np
import pandas as pd
from numpy.random import randn
np.set_printoptions(precision=5, suppress=True)
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston #dataet
import statsmodels.api as sm
import random
from prettytable import PrettyTable

# load dataset
bostondf = load_boston()
print(bostondf.data.shape)

# create data frame
boston = pd.DataFrame(bostondf.data, columns=bostondf.feature_names)

# create data frame for the target variable - median value of homes
target = pd.DataFrame(bostondf.target, columns=["MEDV"])

(506, 13)

After importing necessary packages and loading data, next is to take a quick look at the data. Below are the descriptive statistics for each of the variables, which might be useful when trying to decide whether to impute with mean, median, or some other value.¶

#get variable info
boston.describe()

The target variable is median house values in $1000's.¶

#MEDV info
target.describe()

#plot target variable
plt.hist(bostondf['target'], bins=20)

(array([ 9., 12., 18., 37., 40., 42., 83., 71., 72., 12., 23., 18., 16.,
        14.,  7.,  1.,  5.,  5.,  2., 19.]),
 array([ 5.  ,  7.25,  9.5 , 11.75, 14.  , 16.25, 18.5 , 20.75, 23.  ,
        25.25, 27.5 , 29.75, 32.  , 34.25, 36.5 , 38.75, 41.  , 43.25,
        45.5 , 47.75, 50.  ]),
 <a list of 20 Patch objects>)

Figure 1: Plot of Distribution of the Target Variable - The median value of homes (MEDV).¶

Back to Top

Baseline Model¶

Using Sklearn get the Boston Housing dataset. Fit a linear regressor to the data as a baseline. There is no need to do Cross-Validation. We are exploring the change in results. What is the loss and what are the goodness of fit parameters? This will be our baseline for comparison for future tests.¶

First, a baseline model will be created for comparison. Statsmodel's OLS() function is used to get the $R^2$ value and the Root Mean Square Error, or RMSE, for the linear regression model. $R^2$ is a statistical measure of the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. An $R^2$ closer to 1 indicates a better fit. The RSME is an error metric of the differences between values (sample or population values) predicted by a model. The lower score is preferred for RMSE.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#base model with MEDV as target
X = sm.add_constant(boston)
regOLS = sm.OLS(target,X).fit()

#linear regression prediction
sm_pred = regOLS.predict(X)
base_RMSE = np.sqrt(mean_squared_error(target, sm_pred))
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)

Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

Back to Top

First Imputation Model - Step 2¶

Missing Completely at Random¶

For select between 1, 5 10, 20, 33, and 50% of your data on a single column (Completely at random), replace the present value with a NAN and then perform an imputation of that value.¶

For this model, LSTAT was chosen as the variable to be imputed. Functions are created to drop random instances of LSTAT of a certain percentage (1%, 5%, 10%, etc.) of the entire dataset and then replace them with the mean of the variable which is 12.65. After imputation, the regression model is run again for comparison with the baseline.

#function for dropping values from full dataset
def naDrop(a, percent):
    # creates a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask - will drop values at random for a single column
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

#function for imputation and regression
def step2_impute(data_loss):
    boston1 = boston.copy()
    data1 = boston1['LSTAT']
    modified1 = naDrop(data1, data_loss)
    boston1['LSTAT'] = modified1

    # replace Nan with the mean of the variable
    boston1['LSTAT'].fillna(boston1['LSTAT'].mean(), inplace=True)
    
    #regression model with MEDV as target
    X = sm.add_constant(boston1) 
    regOLS = sm.OLS(target,X).fit() 
 
    #linear regression prediction
    sm_pred = regOLS.predict(X) 
    RMSE = np.sqrt(mean_squared_error(target, sm_pred))
    return((RMSE,regOLS._results.rsquared))

#function defines what percentage of variables to drop/impute & display results
def step2_fit_loss():
    losses = [0.01, 0.05, 0.10, 0.20, 0.33, 0.5]
    #Display in table
    table = PrettyTable(['Imputation', 'RMSE', 'R2'])
    for loss in losses:
        RMSE, r2_test = step2_impute(loss)
        percentage = loss * 100
        table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
    print(table)

#run functions and print results
step2_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)

+------------+--------+--------+
| Imputation |  RMSE  |   R2   |
+------------+--------+--------+
|    1.0%    | 4.688  | 0.7397 |
|    5.0%    | 4.6963 | 0.7387 |
|   10.0%    | 4.7962 | 0.7275 |
|   20.0%    | 4.7451 | 0.7333 |
|   33.0%    | 4.8744 | 0.7186 |
|   50.0%    | 5.1002 | 0.6919 |
+------------+--------+--------+
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

As expected, the baseline model still has the best RMSE and $R^2$ scores. Though there aren't large differences in the RMSE and $R^2$ results for imputed data, the lowest RMSE is the model with 1% of the `LSTAT` variable imputed with the mean. The error rises and the $R^2$ lowers as the percentage of missing data increases, except for the 20% imputation model which is better than the 10% model. The model with the highest error and lowest $R^2$, unsurprisingly, is the model with 50% of the `LSTAT` variable imputed.¶

Back to Top

Second Imputation Model - Step 3¶

Missing at Random¶

Take 2 different columns and create data “Missing at Random” when controlled for a third variable (i.e if Variable Z is > 30, than Variables X, Y are randomly missing). Make runs with 10%, 20% and 30% missing data imputed via your best guess. Repeat your fit and comparisons to the baseline.¶

For the second imputation model, at distances over 3 miles of the DIS variable, the NOX and ZN variables are set to drop random instances of a certain percentage (10%, 20%, 30%) of the entire dataset. These variables are then replaced with the mean or median of the variable.

#function for dropping values from full dataset
def step3_drop(a, percent):
    mat = a.copy()
    prop = int(mat.size * percent)
    mask = random.sample(range(mat.size), prop)
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

#function for imputation and regression
def step3_impute(data_loss):
    boston1 = boston.copy()
    #first variable
    data1 = boston1['NOX'].loc[boston1['DIS'] > 3 ]
    modified1 = step3_drop(data1, data_loss)
    boston1['NOX'] = modified1
    
    # impute Nan with the mean for first variable
    boston1['NOX'].fillna(boston1['NOX'].mean(), inplace=True)
    
    #second variable
    data1 = boston1['ZN'].loc[boston1['DIS'] > 3 ]
    modified1 = step3_drop(data1, data_loss)
    boston1['ZN'] = modified1
    # replace Nan with the median for second variable
    boston1['ZN'].fillna(boston1['ZN'].median(), inplace=True)
    
    #regression model with MEDV as target
    X = sm.add_constant(boston1) 
    regOLS = sm.OLS(target,X).fit()
     
 
    #linear regression prediction
    sm_pred = regOLS.predict(X) 
    RMSE = np.sqrt(mean_squared_error(target, sm_pred))
    return((RMSE,regOLS._results.rsquared))

#function defines what percentage of variables to drop/impute & display results
def step3_fit_loss():
    losses = [0.10, 0.20, 0.30]
    table = PrettyTable(['Imputation', 'RMSE', 'R2'])
    for loss in losses:
        RMSE, r2_test = step3_impute(loss)
        percentage = loss * 100
        table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
    print(table)

#run functions and print results
step3_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)

+------------+--------+--------+
| Imputation |  RMSE  |   R2   |
+------------+--------+--------+
|   10.0%    | 4.7829 | 0.729  |
|   20.0%    | 4.7901 | 0.7282 |
|   30.0%    | 4.8121 | 0.7257 |
+------------+--------+--------+
Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516

These MAR imputation models behave as expected, with the RMSE and $R^2$ scores becoming worse as the percentage of imputed data rises. The lowest RMSE and highest $R^2$ is the model with 10% of the `NOX` and `ZN` variables imputed with the mean or median. This 10% model is closest to the baseline model, but the baseline is still better.¶

Back to Top

Third Imputation Model - Step 4¶

Missing Not at Random¶

Create a Missing Not at Random pattern in which 25% of the data is missing for a single column. Impute your data, fit the results and compare to a baseline.¶

For this imputation model, the AGE variable was chosen. To substitute a quarter of the data for AGE not at random, the first quartile of the data (less than age 45) was replaced by NaN. These instances were then imputed with the mean of the entire variable. The mean AGE is 68.57, which is quite a difference from 45.

#create copy and drop ages 45 and under
boston1 = boston.copy()
boston1.loc[boston1.AGE <= 45,['AGE']] = np.nan

#show null value count
print(boston1.isnull().sum())

CRIM         0
ZN           0
INDUS        0
CHAS         0
NOX          0
RM           0
AGE        127
DIS          0
RAD          0
TAX          0
PTRATIO      0
B            0
LSTAT        0
dtype: int64

    
#impute Nan with the mean for age variable
boston1['AGE'].fillna(boston1['AGE'].mean(), inplace=True)

#regression model with MEDV as target
X = sm.add_constant(boston1) 
regOLS = sm.OLS(target,X).fit()

#linear regression prediction 
sm_pred = regOLS.predict(X) 
RMSE = np.sqrt(mean_squared_error(target, sm_pred))

#print baseline
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
#print results
print('---------------------')
print('R Square', regOLS._results.rsquared)
print('RMSE:',RMSE)

Baseline R Square 0.7406077428649427
Baseline RMSE: 4.679506300635516
---------------------
R Square 0.7421573514330941
RMSE: 4.665507684073446

The result of this MNAR model's linear regression are close to the baseline. Surprisingly, the $R^2$ and RMSE actually indicate a slightly better fit than the baseline. For this dataset, substituting ages under 45 with the mean age 68 did not greatly affect the outcome of the multiple regression.¶

¶

Back to Top

Conclusion¶

Imputation is a must for most datasets, but should be thoroughly thought through. Changes to NaN data can cause changes to the results and conclusions drawn. In this example, generally the higher the percentage of data that was imputed, the "worse" the regression model result. Though for most results, the RMSE and $R^2$ did not deviate greatly from the baseline. Perhaps, different variables would have been affected by imputation more and impacted the results of regressions to a larger extent. In the future, it would be interesting to try, not only other variables, but also other imputation techniques to see how it effects analysis.

References¶

Back to Top

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000

	MEDV
count	506.000000
mean	22.532806
std	9.197104
min	5.000000
25%	17.025000
50%	21.200000
75%	25.000000
max	50.000000