When working with real-world datasets, it is often the case that the data will require much cleaning a preparation to be useful. This is especially the case when portions of data that are missing. An analyst should examine missing data to try to understand why it is missing. Then they must carefully decide how to handle the missing values: could the records that have missing data be removed or imputed? If imputed, what method should be used? How these issues are handled can greatly affect the final data analysis.
For this case study, three different patterns of missing data are created using the Boston Housing Dataset from sklearn: a missing completely at random model (MCAR), a missing at random model (MAR) and a missing not at random model (MNAR). A baseline regression model with the full dataset was created so that the effects of imputation on missing data can be compared. Comparison will be made with Root Mean Square Error and $R^2$. One might expect that the more data that is imputed, the more error would be introduced to the regression and the fit would be affected adversely.
This Boston Housing Dataset was available from Sklearn. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Each record in the database describes a Boston suburb or town. There are 506 records and 14 attributes. The target variable is the median value of a home, or MEDV. The attributes are deļ¬ned as follows:
| Attribute | Drescription |
|---|---|
| CRIM | per capita crime rate by town |
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. |
| INDUS | proportion of non-retail business acres per town |
| CHAS | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| NOX | nitric oxides concentration (parts per 10 million) |
| RM | average number of rooms per dwelling |
| AGE | proportion of owner-occupied units built prior to 1940 |
| DIS | weighted distances to five Boston employment centres |
| RAD | index of accessibility to radial highways |
| TAX | full-value property-tax rate per 10,000 dollars |
| PTRATIO | pupil-teacher ratio by town |
| B | $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town |
| LSTAT | percentage lower status of the population |
| MEDV | Median value of owner-occupied homes in $1000's |
#import packages
import numpy as np
import pandas as pd
from numpy.random import randn
np.set_printoptions(precision=5, suppress=True)
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston #dataet
import statsmodels.api as sm
import random
from prettytable import PrettyTable
# load dataset
bostondf = load_boston()
print(bostondf.data.shape)
# create data frame
boston = pd.DataFrame(bostondf.data, columns=bostondf.feature_names)
# create data frame for the target variable - median value of homes
target = pd.DataFrame(bostondf.target, columns=["MEDV"])
#get variable info
boston.describe()
#MEDV info
target.describe()
#plot target variable
plt.hist(bostondf['target'], bins=20)
First, a baseline model will be created for comparison. Statsmodel's OLS() function is used to get the $R^2$ value and the Root Mean Square Error, or RMSE, for the linear regression model. $R^2$ is a statistical measure of the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. An $R^2$ closer to 1 indicates a better fit. The RSME is an error metric of the differences between values (sample or population values) predicted by a model. The lower score is preferred for RMSE.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
#base model with MEDV as target
X = sm.add_constant(boston)
regOLS = sm.OLS(target,X).fit()
#linear regression prediction
sm_pred = regOLS.predict(X)
base_RMSE = np.sqrt(mean_squared_error(target, sm_pred))
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
For this model, LSTAT was chosen as the variable to be imputed. Functions are created to drop random instances of LSTAT of a certain percentage (1%, 5%, 10%, etc.) of the entire dataset and then replace them with the mean of the variable which is 12.65. After imputation, the regression model is run again for comparison with the baseline.
#function for dropping values from full dataset
def naDrop(a, percent):
# creates a copy
mat = a.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask - will drop values at random for a single column
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
#function for imputation and regression
def step2_impute(data_loss):
boston1 = boston.copy()
data1 = boston1['LSTAT']
modified1 = naDrop(data1, data_loss)
boston1['LSTAT'] = modified1
# replace Nan with the mean of the variable
boston1['LSTAT'].fillna(boston1['LSTAT'].mean(), inplace=True)
#regression model with MEDV as target
X = sm.add_constant(boston1)
regOLS = sm.OLS(target,X).fit()
#linear regression prediction
sm_pred = regOLS.predict(X)
RMSE = np.sqrt(mean_squared_error(target, sm_pred))
return((RMSE,regOLS._results.rsquared))
#function defines what percentage of variables to drop/impute & display results
def step2_fit_loss():
losses = [0.01, 0.05, 0.10, 0.20, 0.33, 0.5]
#Display in table
table = PrettyTable(['Imputation', 'RMSE', 'R2'])
for loss in losses:
RMSE, r2_test = step2_impute(loss)
percentage = loss * 100
table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
print(table)
#run functions and print results
step2_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
LSTAT variable imputed with the mean. The error rises and the $R^2$ lowers as the percentage of missing data increases, except for the 20% imputation model which is better than the 10% model. The model with the highest error and lowest $R^2$, unsurprisingly, is the model with 50% of the LSTAT variable imputed.¶For the second imputation model, at distances over 3 miles of the DIS variable, the NOX and ZN variables are set to drop random instances of a certain percentage (10%, 20%, 30%) of the entire dataset. These variables are then replaced with the mean or median of the variable.
#function for dropping values from full dataset
def step3_drop(a, percent):
mat = a.copy()
prop = int(mat.size * percent)
mask = random.sample(range(mat.size), prop)
np.put(mat, mask, [np.NaN]*len(mask))
return mat
#function for imputation and regression
def step3_impute(data_loss):
boston1 = boston.copy()
#first variable
data1 = boston1['NOX'].loc[boston1['DIS'] > 3 ]
modified1 = step3_drop(data1, data_loss)
boston1['NOX'] = modified1
# impute Nan with the mean for first variable
boston1['NOX'].fillna(boston1['NOX'].mean(), inplace=True)
#second variable
data1 = boston1['ZN'].loc[boston1['DIS'] > 3 ]
modified1 = step3_drop(data1, data_loss)
boston1['ZN'] = modified1
# replace Nan with the median for second variable
boston1['ZN'].fillna(boston1['ZN'].median(), inplace=True)
#regression model with MEDV as target
X = sm.add_constant(boston1)
regOLS = sm.OLS(target,X).fit()
#linear regression prediction
sm_pred = regOLS.predict(X)
RMSE = np.sqrt(mean_squared_error(target, sm_pred))
return((RMSE,regOLS._results.rsquared))
#function defines what percentage of variables to drop/impute & display results
def step3_fit_loss():
losses = [0.10, 0.20, 0.30]
table = PrettyTable(['Imputation', 'RMSE', 'R2'])
for loss in losses:
RMSE, r2_test = step3_impute(loss)
percentage = loss * 100
table.add_row(['{}%'.format(percentage), round(RMSE, 4), round(r2_test, 4)])
print(table)
#run functions and print results
step3_fit_loss()
#print baseline model
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
NOX and ZN variables imputed with the mean or median. This 10% model is closest to the baseline model, but the baseline is still better.¶For this imputation model, the AGE variable was chosen. To substitute a quarter of the data for AGE not at random, the first quartile of the data (less than age 45) was replaced by NaN. These instances were then imputed with the mean of the entire variable. The mean AGE is 68.57, which is quite a difference from 45.
#create copy and drop ages 45 and under
boston1 = boston.copy()
boston1.loc[boston1.AGE <= 45,['AGE']] = np.nan
#show null value count
print(boston1.isnull().sum())
#impute Nan with the mean for age variable
boston1['AGE'].fillna(boston1['AGE'].mean(), inplace=True)
#regression model with MEDV as target
X = sm.add_constant(boston1)
regOLS = sm.OLS(target,X).fit()
#linear regression prediction
sm_pred = regOLS.predict(X)
RMSE = np.sqrt(mean_squared_error(target, sm_pred))
#print baseline
print('Baseline R Square', base_regOLS)
print('Baseline RMSE:',base_RMSE)
#print results
print('---------------------')
print('R Square', regOLS._results.rsquared)
print('RMSE:',RMSE)
Imputation is a must for most datasets, but should be thoroughly thought through. Changes to NaN data can cause changes to the results and conclusions drawn. In this example, generally the higher the percentage of data that was imputed, the "worse" the regression model result. Though for most results, the RMSE and $R^2$ did not deviate greatly from the baseline. Perhaps, different variables would have been affected by imputation more and impacted the results of regressions to a larger extent. In the future, it would be interesting to try, not only other variables, but also other imputation techniques to see how it effects analysis.