Quantifying the World

Case Study 6 - Spam

Question 19 - Recursive Partitioning and Classification Trees

Stacy Conant

Feb 17, 2020

Back to Top

Introduction

It is hard to imagine a productive work and social life without Email. Checking an email inbox is an essential part of many peoples daily routine. But this wonderful advancement in technology and productivity is not with out its drawbacks, namely spam, or the name given to the unwanted, obtrusive, and sometimes dangerous advertising emails that can over-run an inbox. Though it is usually quite easy to spot a spam email without opening it by it's subject or sender name, most email services have an automated procedure to identify a spam email and sort it to a spam folder.

This study will examine a dataset of over 9000 email messages and attempt to distinguish an authentic email, or ham, from an inauthentic email, or spam. To accomplish this, Recursive Partitioning and Classification Trees will be explored using the R package rpart. Recursive partitioning is a set of nonparametric techniques for classification and prediction. It is a useful method that can be easily visualized by the "trees" produced that display the predicted class or predicted value. The exploration will begin with the methods discussed in Nolan and Lang's Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. Specifically, this study will explore the following questions:

  • Consider the other parameters that can be used to control the recursive partitioning process.
  • Read the documentation for them in the rpart.control() documentation. Carry out an Internet search for more information on how to tweak the rpart() tuning parameters.
  • Experiment with values for these parameters. Do the trees that result make sense with your understanding of how the parameters are used? Can you improve the prediction using them?

Back to Top

Data

The data comes from the open-source Apache SpamAssassin Project. The dataset contains 9348 email messages that have already been extensively cleaned and prepared for analysis using Nolan and Lang's method. Below is the description of the variables in the dataset.

Variable Description
isRe TRUE if Re: appears at the start of the subject.
numLines Number of lines in the body of the message.
bodyCharCt Number of characters in the body of the message.
underscore TRUE if email address in the From field of the header contains an underscore.
subExcCt Number of exclamation marks in the subject.
subQuesCt Number of question marks in the subject.
numAtt Number of attachments in the message.
priority TRUE if a Priority key is present in the header.
numRec Number of recipients of the message, including CCs.
perCaps Percentage of capitals among all letters in the message body, excluding attachments.
isInReplyTo TRUE if the In-Reply-To key is present in the header.
sortedRec TRUE if the recipients’ email addresses are sorted.
subPunc TRUE if words in the subject have punctuation or numbers embedded in them, e.g., w!se.
hour Hour of the day in the Date field.
multipartText TRUE if the MIME type is multipart/text.
hasImages TRUE if the message contains images.
isPGPsigned TRUE if the message contains a PGP signature.
perHTML Percentage of characters in HTML tags in the message body in comparison to all characters.
subSpamWords TRUE if the subject contains one of the words in a spam word vector.
subBlanks Percentage of blanks in the subject.
noHost TRUE if there is no hostname in the Message-Id key in the header.
numEnd TRUE if the email sender’s address (before the @) ends in a number.
isYelling TRUE if the subject is all capital letters.
forwards Number of forward symbols in a line of the body, e.g., >>> xxx contains 3 forwards.
isOrigMsg TRUE if the message body contains the phrase original message.
isDear TRUE if the message body contains the word dear.
isWrote TRUE if the message contains the phrase wrote:.
avgWordLen The average length of the words in a message.
numDlr Number of dollar signs in the message body.
In [1]:
#load libraries and data
library(rpart.plot)
library(rattle)
library("IRdisplay")
load("spamData.rda")
Warning message:
"package 'rpart.plot' was built under R version 3.6.2"Loading required package: rpart
Warning message:
"package 'rattle' was built under R version 3.6.2"Rattle: A free graphical interface for data science with R.
Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
In [2]:
#check data
head(emailDFrp)
isSpamisReunderscorepriorityisInReplyTosortedRecsubPuncmultipartTexthasImagesisPGPsigned...subQuesCtnumAttnumRecperCapshourperHTMLsubBlanksforwardsavgWordLennumDlr
./Unit6//messages/easy_ham1F T F F T T F F F F ... 0 0 2 4.451039 11 0 12.50000 0.0000004.376623 3
./Unit6//messages/easy_ham2F F F F F T F F F F ... 0 0 1 7.491289 11 0 8.00000 0.0000004.555556 0
./Unit6//messages/easy_ham3F F F F F T F F F F ... 0 0 1 7.436096 12 0 8.00000 0.0000004.817164 0
./Unit6//messages/easy_ham4F F F F F T F F F F ... 0 0 0 5.090909 13 0 18.91892 3.1250004.714286 0
./Unit6//messages/easy_ham5F T F F F T F F F F ... 0 0 1 6.116643 13 0 15.21739 6.4516134.234940 0
./Unit6//messages/easy_ham6F T F F T T F F F F ... 0 0 1 7.625272 13 0 15.21739 12.0000003.956897 0
In [3]:
#checking summary stats to get spam message numbers
summary(emailDFrp$isSpam)
F
6951
T
2397
In [4]:
#total number of emails, spams and non-spams(ham)
numEmail = 9348
numSpam = 2397
numHam = 6951

Back to Top

Method

After loading the data and R packages, the task is to run a baseline classification tree using the rpart package. From there, alternate settings for rpart.control() will be systematically explored to assess their effect and usefulness on our model.

After running a recursive partitioning model, the predict() function will used to assess how well the classifier does at predicting if the test messages are legitimate (ham) or spam. The rpart fitted prediction are then compared to the actual message classifications and the various models are evaluated with Type I and Type II error. It is appropriate to review these terms.

Type I error

Type I error is the rejection of a true null hypothesis as the result of a test procedure. In this case, it would be classifying a message as spam, when it is actually a legitimate email (ham), or losing a real email to the spam folder. A type I error is equivalent to a false positive.

Type II error

Type II error is the failure to reject a false null hypothesis as the result of a test procedure. In this case, it would be classifying a spam message as a legitimate email (ham), or sending spam to your inbox. A type II error is equivalent to a false negative.

It is more important to reduce Type I error because an email user would prefer a lower risk of legitimate mail being classified as spam and potentially sent to spam folder or deleted.

Back to Top

Baseline Model

To run the baseline rpart model with default rpart.control() parameters, a seed is set for consistent results. Then, test indices are created for both spam and ham emails and the training and test sets are split into 2/3 training and 1/3 test sets. Finally, the recursive partitioning and classification tree is fit with rpart().

In [5]:
#Set seed for consistent results
set.seed(418910)
#create indices
testSpamIdx = sample(numSpam, size = floor(numSpam/3))
testHamIdx = sample(numHam, size = floor(numHam/3))
#create training and test splits
testDF = 
  rbind( emailDFrp[ emailDFrp$isSpam == "T", ][testSpamIdx, ],
         emailDFrp[emailDFrp$isSpam == "F", ][testHamIdx, ] )
trainDF =
  rbind( emailDFrp[emailDFrp$isSpam == "T", ][-testSpamIdx, ], 
         emailDFrp[emailDFrp$isSpam == "F", ][-testHamIdx, ])
#fit the classification tree
rpartFit = rpart(isSpam ~ ., data = trainDF, method = "class")
In [6]:
# to assess how well the classifier does at predicting if the test messages are spam  or ham
predictions = predict(rpartFit, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

summary(predsForHam)
#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Baseline Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Baseline Model is:")
sum(predsForSpam == "F") / length(predsForSpam)
F
2167
T
150
[1] "Type I Error for Baseline Model is:"
0.0647388864911524
[1] "Type II Error for Baseline Model is:"
0.18648310387985

The proportion of ham messages that have been misclassified as spam (Type I) is 0.065 or 6.5%. The proportion of spam messages misclassified as ham (Type II) is 0.186 or 18.6%.

This study will explore how these baseline error rates change when various parameters in rpart.contol() are manipulated.

Below is the tree produced for the spam data with the default control parameters set.

In [7]:
#no control params set
#fancyRpartPlot(rpartFit, extra = 1)
rpart.plot(rpartFit,extra = 1, main = "Tree to Predict Spam with Default Control Parameters")

Figure 1: Tree to Predict Spam with Default Control Parameters. Baseline tree fitted using `rpart()`.
</h3>

Next, rpart.control will be examined to assess which control parameters could be changed and how they will effect the spam idnetification trees and the error rates.

In [8]:
#check control parameters defaults
args(rpart.control)
function (minsplit = 20L, minbucket = round(minsplit/3), cp = 0.01, 
    maxcompete = 4L, maxsurrogate = 5L, usesurrogate = 2L, xval = 10L, 
    surrogatestyle = 0L, maxdepth = 30L, ...) 
NULL

Back to Top

Complexity Parameter

The complexity parameter, or CP, is a threshold for any split where the overall lack of fit not decreased by a factor of cp is not attempted. The main role of this parameter is to save computing time by pruning off splits that are not worth the effort. To test multiple values of CP a list of values is created in complexityVals as was done in Nolan and Lang's work.

In [9]:
#list of values of CP
set.seed(418910)
complexityVals = c(seq(0.00001, 0.0001, length=19),  
                   seq(0.0001, 0.001, length=19),  
                   seq(0.001, 0.005, length=9),  
                   seq(0.005, 0.01, length=9)) 
#function to apply each CP to rpart()
fits = lapply(complexityVals, function(x) {  
    rpartObj = rpart(isSpam ~., data = trainDF,  
                     method="class",  
                     control = rpart.control(cp=x))  
    predict(rpartObj,  newdata = testDF[, names(testDF) != "isSpam"],  
            type = "class")  
})
In [10]:
#type I and II error evaluation for all CP values
spam = testDF$isSpam == "T"  
numSpam = sum(spam)  
numHam = sum(!spam)  
errs = sapply(fits, function(preds) {  
    typeI = sum(preds[ !spam] == "T")/ numHam  
    typeII = sum(preds[ spam] == "F")/ numSpam  
    c(typeI = typeI, typeII = typeII)  
})

Next, the Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.

In [11]:
library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ complexityVals, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.2), xlim = c(0,0.01), 
     ylab="Error", xlab="Complexity Parameter Values")
points(errs[2,] ~ complexityVals, type="l", col=cols[1], lwd = 2)

text(x =c(0.003, 0.003), y = c(0.12, 0.03), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = complexityVals[minI], col ="grey", lty =3, lwd=2)

text(0.0007, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(0.0007, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))

print(complexityVals[minI])
[1] 0.0015

Figure 2: Plot of the Type I and II errors for predicting spam as a function of the size of the complexity parameter in the `rpart.control()` function.

The CP value that produces the lowest type I error rate (3.4%) is at 0.0015. The type II error at this CP value is 17.9%. The type I error at 0.0015 is lower than the baseline model, but the type II error is higher than the baseline model.

In [12]:
CPFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(cp = 0.0015))
rpart.plot(CPFit,extra = 1, main = "Tree to Predict Spam with Optimized CP Parameter")
Warning message:
"labs do not fit even at cex 0.15, there may be some overplotting"

Figure 3: Tree to Predict Spam with Optimized CP Parameter set at 0.0015. This setting produces a very large tree compared to the default setting of 0.01.

Back to Top

Minimum Split

The minimum split is the minimum number of observations that must exist in a node in order for a split to be attempted. As with the CP, multiple split values will be tested. The function is modified to try split minimums between 1 and 300.

In [13]:
splits = c(1:300)

splitFits = lapply(splits, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(minsplit=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(splitFits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.

In [14]:
library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ splits, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.3), xlim = c(0,300), 
     ylab="Error", xlab="Minimum Split Parameter Values")
points(errs[2,] ~ splits, type="l", col=cols[1], lwd = 2)

text(x =c(50, 50), y = c(0.16, 0.04), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = splits[minI], col ="grey", lty =3, lwd=2)

text(120, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(120, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))

print(splits[minI])
[1] 131

Figure 4: Plot of the Type I and II errors for predicting spam as a function of the size of the minsplit parameter in the `rpart.control()` function.

The minsplit value that produces the lowest type I error rate (4.4%) is at 131. The type II error at this CP value is 27.7%. The type I error at 131 is lower than the baseline model, but the type II error is much higher than the baseline model.

In [15]:
SplitFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minsplit = 131))
rpart.plot(SplitFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Split Parameter")

Figure 5: Tree to Predict Spam with Optimized Minimum Split Parameter set at 131.
</h3>

Back to Top

Minimum Buckets

The minimum bucket parameter controls the minimum number of observations in any terminal leaf node. This parameter seems to be related to minimum split. The function is modified to try bucket minimums between 1 and 150.

In [16]:
#modification for minimum buckets
buckets = c(1:150)

bucketFits = lapply(buckets, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(minbucket=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(bucketFits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.

In [17]:
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ buckets, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.45), xlim = c(0,150), 
     ylab="Error", xlab="minbucket parameter values")
points(errs[2,] ~ buckets, type="l", col=cols[1], lwd = 2)

text(x =c(10, 11), y = c(0.15, 0.035), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = buckets[minI], col ="grey", lty =3, lwd=2)

text(35, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(35, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Minbucket Error Rates")

print(buckets[minI])
[1] 44

Figure 6: Plot of the Type I and II errors for predicting spam as a function of the minimumm bucket parameter in the `rpart.control()` function.
</h3>

The minimum bucket errors are the exact same as the minimum split except that the number of minimum buckets is less. Upon consulting the rpart.control documentation it is revealed that the buckets and splits are related. The minimum bucket is one third of the amount of the minimum split. Because of this, in the final model, it should only be necessary to specify one of those parameters, either splits or buckets.

In [18]:
#classification tree
bucketFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minbucket = 44))
rpart.plot(bucketFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")

Figure 7: Tree to Predict Spam with Optimized Minimum Bucket Parameter set at 44. Because minsplit and minbucket are related, the trees for these parameters are the exact same.
</h3>

Back to Top

Maximum Compete

The maxcompete parameter sets the number of competitor splits retained in the output. The function is modified to try compete maximums between 1 and 10.

In [19]:
#modification for max compete
compete = c(1:10)

comfits = lapply(compete, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(maxcompete=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(comfits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.

In [20]:
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ compete, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.7), xlim = c(0,10), 
     ylab="Error", xlab="Maxcomplete Parameter Values")
points(errs[2,] ~ compete, type="l", col=cols[1], lwd = 2)

text(x =c(2.2, 2.2), y = c(0.65, 0.15), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = compete[minI], col ="grey", lty =3, lwd=2)

text(2.3, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Maxcompete Error Rates")

print(compete[minI])
[1] 1

Figure 8: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum compete parameter in the `rpart.control()` function.

For values 1 - 10, the error rates do not change. That means that for this dataset, maxcompete can left at its default value of 4. The tree set at any value betwen 1 and 10 should look exactly the same as the default tree.

In [21]:
#tree graph
competeFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxcompete = 5))
rpart.plot(competeFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")

Figure 9: Tree to Predict Spam with Optimized Maximum Compete Parameter set at 5.
</h3>

As expected, the tree is exactly the same with max compete set at 5 and it was at the default of 4.

Back to Top

Maximum Depth

The maxdepth parameter sets the maximum depth of any node of the final tree, with the root node counted as depth 0. This is a useful parameter that can be systematically incremented to aid in understanding variable importance in a model. Here, the function is modified to try maximum tree depths between 1 and 30.

In [22]:
#modification for max depth
depth = c(1:30)

fits = lapply(depth, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(maxdepth=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(fits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.

In [23]:
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ depth, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.7), xlim = c(0,15), 
     ylab="Error", xlab="maxdepth parameter values")
points(errs[2,] ~ depth, type="l", col=cols[1], lwd = 2)

text(x =c(2.2, 2.2), y = c(0.65, 0.15), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = depth[minI], col ="grey", lty =3, lwd=2)

text(2.3, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Maxdepth Error Rates")

print(depth[minI])
[1] 3

Figure 10: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum depth parameter in the `rpart.control()` function.
</h3>

In [24]:
#graph of tree
depthFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxdepth = 3))
rpart.plot(depthFit,extra = 1, main = "Tree to Predict Spam with Optimized Maximum Depth Parameter")

Figure 11: Tree to Predict Spam with Optimized Maximum Depth Parameter set at 3. The optimized tree has a depth of 3 while the baseline tree has a depth of 8.

Back to Top

Optimized Model

Finally, now that the experimentation with the control parameters is concluded, hopefully a better model can be built from what has been learned. Recall that the Type II Error for the baseline model with default control parameters was 0.065 or 6.5% and the Type II Error was 0.186 or 18.6%. A new rpart model will be run with the values of the lowest Type I Error from our indivdual control parameter testing. Maxcomplete and minbucket can be removed because they do not impact the outcome.

In [25]:
#fit for optimized model
rpartFit1 = rpart(isSpam ~ ., data = trainDF, method = "class", 
                 control = rpart.control(cp = 0.0015,
                                         minsplit=131,
                                         #minbucket=44,
                                         maxdepth=3,
                                         #maxcompete=1,
                                        ))
In [26]:
predictions = predict(rpartFit1, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Optimized Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Optimized Model is:") 
sum(predsForSpam == "F") / length(predsForSpam)
[1] "Type I Error for Optimized Model is:"
0.0233059991368148
[1] "Type II Error for Optimized Model is:"
0.481852315394243

This optimized model achieves a very low type I error at 0.023, but the type II error has increased greatly from the baseline default model to 0.482. This means that as a spam filter, only 2% of true messages (ham) would be incorrectly classified as spam, but nearly 50% of all spam messages would end up in our inbox instead of being filtered out.

In [27]:
#tree for optimized model
rpart.plot(rpartFit1,extra = 1, main = "Tree to Predict Spam with Optimized Control Parameters")

Figure 12: Tree to Predict Spam with All Optimized Control Parameters.

The above model may yield a low type I error, but is not ideal because of the high type II error. Perhaps, in some cases, optimizing there parameters individually does not yield the optimal outcome. In effort to get the type II error down, some different parameter values will be tried in the model. Looking back and re-assessing the plot of errors for the various parameters, new values were chosen where the type II error is lower, but does not drive the type I error up too much. In this modified optimized model, the cp remained at 0.0015, but the minsplit and maxdepth were altered slightly.

Modified Optimized model

In [28]:
#modified optimized model
rpartFit2 = rpart(isSpam ~ ., data = trainDF, method = "class", 
                 control = rpart.control(cp = 0.0015,
                                         minsplit=125,#change from 131
                                         maxdepth=10))#change from 3
In [29]:
predictions = predict(rpartFit2, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Modified Optimized Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Modified Optimized Model is:") 
sum(predsForSpam == "F") / length(predsForSpam)
[1] "Type I Error for Modified Optimized Model is:"
0.0401381096245145
[1] "Type II Error for Modified Optimized Model is:"
0.249061326658323

For this final model, the type I error increases a little from the previous optimized model, but remains low at 0.040. The type II error is reduced to 0.25 from 0.482. This means 4% of ham emails would be misclassified and 25% of spam emails would be misclassified. This model seems like a better compromise; it yields an improved type I error over the baseline, with a slight increase in type II error over the baseline.

In [30]:
#tree plot for modified optimized model
rpart.plot(rpartFit2,extra = 1, main = "Tree to Predict Spam with Modified Optimized Control Parameters")

Figure 13: Tree to Predict Spam with Modified Optimized Parameters set.

Back to Top

Conclusion

This study undertook an examination of the role that the control parameters play in building decision trees and using the rpart package. Using a dataset of email messages classified as spam or ham, multiple values for the individual rpart.control() parameters were assessed by their error rates. A new optimized model was then created that improved the type I error from that of the baseline model with default control parameters, but worsened the type II error rate. Further tuning of the control parameters produced a model with a more acceptable compromise in type II and II error rates.

Model Type I Type II
Baseline 0.065 0.186
Optimized 0.023 0.481
Modified Optimized 0.040 0.249

Further study could be done in to the role that some of the other control parameters play including maxsurrogate, usesurrogate, xval, and surrogatestyle. Additionally, study of the parms and cost parameters could be useful.

References

Back to Top