Quantifying the World¶

Case Study 6 - Spam¶

Question 19 - Recursive Partitioning and Classification Trees¶

Stacy Conant¶

Feb 17, 2020¶

¶

Contents¶

Introduction
Data
Method
Conclusions
References

Back to Top

Introduction¶

It is hard to imagine a productive work and social life without Email. Checking an email inbox is an essential part of many peoples daily routine. But this wonderful advancement in technology and productivity is not with out its drawbacks, namely spam, or the name given to the unwanted, obtrusive, and sometimes dangerous advertising emails that can over-run an inbox. Though it is usually quite easy to spot a spam email without opening it by it's subject or sender name, most email services have an automated procedure to identify a spam email and sort it to a spam folder.

This study will examine a dataset of over 9000 email messages and attempt to distinguish an authentic email, or ham, from an inauthentic email, or spam. To accomplish this, Recursive Partitioning and Classification Trees will be explored using the R package rpart. Recursive partitioning is a set of nonparametric techniques for classification and prediction. It is a useful method that can be easily visualized by the "trees" produced that display the predicted class or predicted value. The exploration will begin with the methods discussed in Nolan and Lang's Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. Specifically, this study will explore the following questions:

Consider the other parameters that can be used to control the recursive partitioning process.
Read the documentation for them in the rpart.control() documentation. Carry out an Internet search for more information on how to tweak the rpart() tuning parameters.
Experiment with values for these parameters. Do the trees that result make sense with your understanding of how the parameters are used? Can you improve the prediction using them?

Back to Top

Data¶

The data comes from the open-source Apache SpamAssassin Project. The dataset contains 9348 email messages that have already been extensively cleaned and prepared for analysis using Nolan and Lang's method. Below is the description of the variables in the dataset.

Variable	Description
isRe	TRUE if Re: appears at the start of the subject.
numLines	Number of lines in the body of the message.
bodyCharCt	Number of characters in the body of the message.
underscore	TRUE if email address in the From field of the header contains an underscore.
subExcCt	Number of exclamation marks in the subject.
subQuesCt	Number of question marks in the subject.
numAtt	Number of attachments in the message.
priority	TRUE if a Priority key is present in the header.
numRec	Number of recipients of the message, including CCs.
perCaps	Percentage of capitals among all letters in the message body, excluding attachments.
isInReplyTo	TRUE if the In-Reply-To key is present in the header.
sortedRec	TRUE if the recipients’ email addresses are sorted.
subPunc	TRUE if words in the subject have punctuation or numbers embedded in them, e.g., w!se.
hour	Hour of the day in the Date field.
multipartText	TRUE if the MIME type is multipart/text.
hasImages	TRUE if the message contains images.
isPGPsigned	TRUE if the message contains a PGP signature.
perHTML	Percentage of characters in HTML tags in the message body in comparison to all characters.
subSpamWords	TRUE if the subject contains one of the words in a spam word vector.
subBlanks	Percentage of blanks in the subject.
noHost	TRUE if there is no hostname in the Message-Id key in the header.
numEnd	TRUE if the email sender’s address (before the @) ends in a number.
isYelling	TRUE if the subject is all capital letters.
forwards	Number of forward symbols in a line of the body, e.g., >>> xxx contains 3 forwards.
isOrigMsg	TRUE if the message body contains the phrase original message.
isDear	TRUE if the message body contains the word dear.
isWrote	TRUE if the message contains the phrase wrote:.
avgWordLen	The average length of the words in a message.
numDlr	Number of dollar signs in the message body.

#load libraries and data
library(rpart.plot)
library(rattle)
library("IRdisplay")
load("spamData.rda")

Warning message:
"package 'rpart.plot' was built under R version 3.6.2"Loading required package: rpart
Warning message:
"package 'rattle' was built under R version 3.6.2"Rattle: A free graphical interface for data science with R.
Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

#check data
head(emailDFrp)

#checking summary stats to get spam message numbers
summary(emailDFrp$isSpam)

#total number of emails, spams and non-spams(ham)
numEmail = 9348
numSpam = 2397
numHam = 6951

Back to Top

Method¶

After loading the data and R packages, the task is to run a baseline classification tree using the rpart package. From there, alternate settings for rpart.control() will be systematically explored to assess their effect and usefulness on our model.

After running a recursive partitioning model, the predict() function will used to assess how well the classifier does at predicting if the test messages are legitimate (ham) or spam. The rpart fitted prediction are then compared to the actual message classifications and the various models are evaluated with Type I and Type II error. It is appropriate to review these terms.

Type I error¶

Type I error is the rejection of a true null hypothesis as the result of a test procedure. In this case, it would be classifying a message as spam, when it is actually a legitimate email (ham), or losing a real email to the spam folder. A type I error is equivalent to a false positive.

Type II error¶

Type II error is the failure to reject a false null hypothesis as the result of a test procedure. In this case, it would be classifying a spam message as a legitimate email (ham), or sending spam to your inbox. A type II error is equivalent to a false negative.

It is more important to reduce Type I error because an email user would prefer a lower risk of legitimate mail being classified as spam and potentially sent to spam folder or deleted.¶

Back to Top

Baseline Model¶

To run the baseline rpart model with default rpart.control() parameters, a seed is set for consistent results. Then, test indices are created for both spam and ham emails and the training and test sets are split into 2/3 training and 1/3 test sets. Finally, the recursive partitioning and classification tree is fit with rpart().

#Set seed for consistent results
set.seed(418910)
#create indices
testSpamIdx = sample(numSpam, size = floor(numSpam/3))
testHamIdx = sample(numHam, size = floor(numHam/3))
#create training and test splits
testDF = 
  rbind( emailDFrp[ emailDFrp$isSpam == "T", ][testSpamIdx, ],
         emailDFrp[emailDFrp$isSpam == "F", ][testHamIdx, ] )
trainDF =
  rbind( emailDFrp[emailDFrp$isSpam == "T", ][-testSpamIdx, ], 
         emailDFrp[emailDFrp$isSpam == "F", ][-testHamIdx, ])
#fit the classification tree
rpartFit = rpart(isSpam ~ ., data = trainDF, method = "class")

# to assess how well the classifier does at predicting if the test messages are spam  or ham
predictions = predict(rpartFit, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

summary(predsForHam)
#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Baseline Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Baseline Model is:")
sum(predsForSpam == "F") / length(predsForSpam)

[1] "Type I Error for Baseline Model is:"

[1] "Type II Error for Baseline Model is:"

The proportion of ham messages that have been misclassified as spam (Type I) is 0.065 or 6.5%. The proportion of spam messages misclassified as ham (Type II) is 0.186 or 18.6%.¶

This study will explore how these baseline error rates change when various parameters in `rpart.contol()` are manipulated.¶

Below is the tree produced for the spam data with the default control parameters set.¶

#no control params set
#fancyRpartPlot(rpartFit, extra = 1)
rpart.plot(rpartFit,extra = 1, main = "Tree to Predict Spam with Default Control Parameters")

Figure 1: Tree to Predict Spam with Default Control Parameters. Baseline tree fitted using `rpart()`. </h3>¶

Next, `rpart.control` will be examined to assess which control parameters could be changed and how they will effect the spam idnetification trees and the error rates.¶

#check control parameters defaults
args(rpart.control)

function (minsplit = 20L, minbucket = round(minsplit/3), cp = 0.01, 
    maxcompete = 4L, maxsurrogate = 5L, usesurrogate = 2L, xval = 10L, 
    surrogatestyle = 0L, maxdepth = 30L, ...) 
NULL

Back to Top

Complexity Parameter¶

The complexity parameter, or CP, is a threshold for any split where the overall lack of fit not decreased by a factor of cp is not attempted. The main role of this parameter is to save computing time by pruning off splits that are not worth the effort. To test multiple values of CP a list of values is created in complexityVals as was done in Nolan and Lang's work.

#list of values of CP
set.seed(418910)
complexityVals = c(seq(0.00001, 0.0001, length=19),  
                   seq(0.0001, 0.001, length=19),  
                   seq(0.001, 0.005, length=9),  
                   seq(0.005, 0.01, length=9)) 
#function to apply each CP to rpart()
fits = lapply(complexityVals, function(x) {  
    rpartObj = rpart(isSpam ~., data = trainDF,  
                     method="class",  
                     control = rpart.control(cp=x))  
    predict(rpartObj,  newdata = testDF[, names(testDF) != "isSpam"],  
            type = "class")  
})

#type I and II error evaluation for all CP values
spam = testDF$isSpam == "T"  
numSpam = sum(spam)  
numHam = sum(!spam)  
errs = sapply(fits, function(preds) {  
    typeI = sum(preds[ !spam] == "T")/ numHam  
    typeII = sum(preds[ spam] == "F")/ numSpam  
    c(typeI = typeI, typeII = typeII)  
})

Next, the Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ complexityVals, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.2), xlim = c(0,0.01), 
     ylab="Error", xlab="Complexity Parameter Values")
points(errs[2,] ~ complexityVals, type="l", col=cols[1], lwd = 2)

text(x =c(0.003, 0.003), y = c(0.12, 0.03), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = complexityVals[minI], col ="grey", lty =3, lwd=2)

text(0.0007, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(0.0007, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))

print(complexityVals[minI])

[1] 0.0015

Figure 2: Plot of the Type I and II errors for predicting spam as a function of the size of the complexity parameter in the `rpart.control()` function.¶

The `CP` value that produces the lowest type I error rate (3.4%) is at 0.0015. The type II error at this `CP` value is 17.9%. The type I error at 0.0015 is lower than the baseline model, but the type II error is higher than the baseline model.¶

CPFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(cp = 0.0015))
rpart.plot(CPFit,extra = 1, main = "Tree to Predict Spam with Optimized CP Parameter")

Warning message:
"labs do not fit even at cex 0.15, there may be some overplotting"

Figure 3: Tree to Predict Spam with Optimized CP Parameter set at 0.0015. This setting produces a very large tree compared to the default setting of 0.01. ¶

Back to Top

Minimum Split¶

The minimum split is the minimum number of observations that must exist in a node in order for a split to be attempted. As with the CP, multiple split values will be tested. The function is modified to try split minimums between 1 and 300.

splits = c(1:300)

splitFits = lapply(splits, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(minsplit=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(splitFits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ splits, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.3), xlim = c(0,300), 
     ylab="Error", xlab="Minimum Split Parameter Values")
points(errs[2,] ~ splits, type="l", col=cols[1], lwd = 2)

text(x =c(50, 50), y = c(0.16, 0.04), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = splits[minI], col ="grey", lty =3, lwd=2)

text(120, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(120, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))

print(splits[minI])

[1] 131

Figure 4: Plot of the Type I and II errors for predicting spam as a function of the size of the minsplit parameter in the `rpart.control()` function.¶

The `minsplit` value that produces the lowest type I error rate (4.4%) is at 131. The type II error at this CP value is 27.7%. The type I error at 131 is lower than the baseline model, but the type II error is much higher than the baseline model.¶

SplitFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minsplit = 131))
rpart.plot(SplitFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Split Parameter")

Figure 5: Tree to Predict Spam with Optimized Minimum Split Parameter set at 131.</h3>¶

Back to Top

Minimum Buckets¶

The minimum bucket parameter controls the minimum number of observations in any terminal leaf node. This parameter seems to be related to minimum split. The function is modified to try bucket minimums between 1 and 150.

#modification for minimum buckets
buckets = c(1:150)

bucketFits = lapply(buckets, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(minbucket=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(bucketFits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ buckets, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.45), xlim = c(0,150), 
     ylab="Error", xlab="minbucket parameter values")
points(errs[2,] ~ buckets, type="l", col=cols[1], lwd = 2)

text(x =c(10, 11), y = c(0.15, 0.035), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = buckets[minI], col ="grey", lty =3, lwd=2)

text(35, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(35, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Minbucket Error Rates")

print(buckets[minI])

[1] 44

Figure 6: Plot of the Type I and II errors for predicting spam as a function of the minimumm bucket parameter in the `rpart.control()` function.</h3>¶

The minimum bucket errors are the exact same as the minimum split except that the number of minimum buckets is less. Upon consulting the `rpart.control` documentation it is revealed that the buckets and splits are related. The minimum bucket is one third of the amount of the minimum split. Because of this, in the final model, it should only be necessary to specify one of those parameters, either splits or buckets.¶

#classification tree
bucketFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minbucket = 44))
rpart.plot(bucketFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")

Figure 7: Tree to Predict Spam with Optimized Minimum Bucket Parameter set at 44. Because minsplit and minbucket are related, the trees for these parameters are the exact same.</h3>¶

Back to Top

Maximum Compete¶

The maxcompete parameter sets the number of competitor splits retained in the output. The function is modified to try compete maximums between 1 and 10.

#modification for max compete
compete = c(1:10)

comfits = lapply(compete, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(maxcompete=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(comfits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ compete, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.7), xlim = c(0,10), 
     ylab="Error", xlab="Maxcomplete Parameter Values")
points(errs[2,] ~ compete, type="l", col=cols[1], lwd = 2)

text(x =c(2.2, 2.2), y = c(0.65, 0.15), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = compete[minI], col ="grey", lty =3, lwd=2)

text(2.3, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Maxcompete Error Rates")

print(compete[minI])

[1] 1

Figure 8: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum compete parameter in the `rpart.control()` function.¶

For values 1 - 10, the error rates do not change. That means that for this dataset, maxcompete can left at its default value of 4. The tree set at any value betwen 1 and 10 should look exactly the same as the default tree.¶

#tree graph
competeFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxcompete = 5))
rpart.plot(competeFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")

Figure 9: Tree to Predict Spam with Optimized Maximum Compete Parameter set at 5. </h3>¶

As expected, the tree is exactly the same with max compete set at 5 and it was at the default of 4.¶

Back to Top

Maximum Depth¶

The maxdepth parameter sets the maximum depth of any node of the final tree, with the root node counted as depth 0. This is a useful parameter that can be systematically incremented to aid in understanding variable importance in a model. Here, the function is modified to try maximum tree depths between 1 and 30.

#modification for max depth
depth = c(1:30)

fits = lapply(depth, function(x) {
         rpartObj = rpart(isSpam ~ ., data = trainDF,
                          method="class", 
                          control = rpart.control(maxdepth=x) )
           
         predict(rpartObj, 
                 newdata = testDF[ , names(testDF) != "isSpam"],
                 type = "class")
        })

spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(fits, function(preds) {
                      typeI = sum(preds[ !spam ] == "T") / numHam
                      typeII = sum(preds[ spam ] == "F") / numSpam
                      c(typeI = typeI, typeII = typeII)
                     })

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ depth, type="l", col=cols[2], 
     lwd = 2, ylim = c(0,0.7), xlim = c(0,15), 
     ylab="Error", xlab="maxdepth parameter values")
points(errs[2,] ~ depth, type="l", col=cols[1], lwd = 2)

text(x =c(2.2, 2.2), y = c(0.65, 0.15), 
     labels=c("Type II Error", "Type I Error"))

minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = depth[minI], col ="grey", lty =3, lwd=2)

text(2.3, errs[1, minI]+0.01, 
     formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01, 
     formatC(errs[2, minI], digits = 3))
title("Maxdepth Error Rates")

print(depth[minI])

[1] 3

Figure 10: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum depth parameter in the `rpart.control()` function.</h3>¶

#graph of tree
depthFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxdepth = 3))
rpart.plot(depthFit,extra = 1, main = "Tree to Predict Spam with Optimized Maximum Depth Parameter")

Figure 11: Tree to Predict Spam with Optimized Maximum Depth Parameter set at 3. The optimized tree has a depth of 3 while the baseline tree has a depth of 8.¶

Back to Top

Optimized Model¶

Finally, now that the experimentation with the control parameters is concluded, hopefully a better model can be built from what has been learned. Recall that the Type II Error for the baseline model with default control parameters was 0.065 or 6.5% and the Type II Error was 0.186 or 18.6%. A new `rpart` model will be run with the values of the lowest Type I Error from our indivdual control parameter testing. `Maxcomplete` and `minbucket` can be removed because they do not impact the outcome.¶

#fit for optimized model
rpartFit1 = rpart(isSpam ~ ., data = trainDF, method = "class", 
                 control = rpart.control(cp = 0.0015,
                                         minsplit=131,
                                         #minbucket=44,
                                         maxdepth=3,
                                         #maxcompete=1,
                                        ))

predictions = predict(rpartFit1, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Optimized Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Optimized Model is:") 
sum(predsForSpam == "F") / length(predsForSpam)

[1] "Type I Error for Optimized Model is:"

[1] "Type II Error for Optimized Model is:"

This optimized model achieves a very low type I error at 0.023, but the type II error has increased greatly from the baseline default model to 0.482. This means that as a spam filter, only 2% of true messages (ham) would be incorrectly classified as spam, but nearly 50% of all spam messages would end up in our inbox instead of being filtered out.¶

#tree for optimized model
rpart.plot(rpartFit1,extra = 1, main = "Tree to Predict Spam with Optimized Control Parameters")

Figure 12: Tree to Predict Spam with All Optimized Control Parameters.¶

The above model may yield a low type I error, but is not ideal because of the high type II error. Perhaps, in some cases, optimizing there parameters individually does not yield the optimal outcome. In effort to get the type II error down, some different parameter values will be tried in the model. Looking back and re-assessing the plot of errors for the various parameters, new values were chosen where the type II error is lower, but does not drive the type I error up too much. In this modified optimized model, the `cp` remained at 0.0015, but the `minsplit` and `maxdepth` were altered slightly.¶

Modified Optimized model¶

#modified optimized model
rpartFit2 = rpart(isSpam ~ ., data = trainDF, method = "class", 
                 control = rpart.control(cp = 0.0015,
                                         minsplit=125,#change from 131
                                         maxdepth=10))#change from 3

predictions = predict(rpartFit2, 
       newdata = testDF[, names(testDF) != "isSpam"],
       type = "class")

#type I error 
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Modified Optimized Model is:") 
sum(predsForHam == "T") / length(predsForHam)

#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Modified Optimized Model is:") 
sum(predsForSpam == "F") / length(predsForSpam)

[1] "Type I Error for Modified Optimized Model is:"

[1] "Type II Error for Modified Optimized Model is:"

For this final model, the type I error increases a little from the previous optimized model, but remains low at 0.040. The type II error is reduced to 0.25 from 0.482. This means 4% of ham emails would be misclassified and 25% of spam emails would be misclassified. This model seems like a better compromise; it yields an improved type I error over the baseline, with a slight increase in type II error over the baseline.¶

#tree plot for modified optimized model
rpart.plot(rpartFit2,extra = 1, main = "Tree to Predict Spam with Modified Optimized Control Parameters")

Figure 13: Tree to Predict Spam with Modified Optimized Parameters set. ¶

¶

Back to Top

Conclusion¶

This study undertook an examination of the role that the control parameters play in building decision trees and using the rpart package. Using a dataset of email messages classified as spam or ham, multiple values for the individual rpart.control() parameters were assessed by their error rates. A new optimized model was then created that improved the type I error from that of the baseline model with default control parameters, but worsened the type II error rate. Further tuning of the control parameters produced a model with a more acceptable compromise in type II and II error rates.

Model	Type I	Type II
Baseline	0.065	0.186
Optimized	0.023	0.481
Modified Optimized	0.040	0.249

Further study could be done in to the role that some of the other control parameters play including maxsurrogate, usesurrogate, xval, and surrogatestyle. Additionally, study of the parms and cost parameters could be useful.

References¶

Back to Top

Nolan, Deborah. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) (p. 101). CRC Press. Kindle Edition.
https://www.rdocumentation.org/packages/rpart/versions/4.1-15/topics/rpart.control
https://cran.r-project.org/web/packages/rpart/rpart.pdf
https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/

	isSpam	isRe	underscore	priority	isInReplyTo	sortedRec	subPunc	multipartText	hasImages	isPGPsigned	...	numRec	perCaps	hour	subBlanks	forwards	avgWordLen	numDlr
./Unit6//messages/easy_ham1	F	T	F	F	T	T	F	F	F	F	...	2	4.451039	11	12.50000	0.000000	4.376623	3
./Unit6//messages/easy_ham2	F	F	F	F	F	T	F	F	F	F	...	1	7.491289	11	8.00000	0.000000	4.555556	0
./Unit6//messages/easy_ham3	F	F	F	F	F	T	F	F	F	F	...	1	7.436096	12	8.00000	0.000000	4.817164	0
./Unit6//messages/easy_ham4	F	F	F	F	F	T	F	F	F	F	...	0	5.090909	13	18.91892	3.125000	4.714286	0
./Unit6//messages/easy_ham5	F	T	F	F	F	T	F	F	F	F	...	1	6.116643	13	15.21739	6.451613	4.234940	0
./Unit6//messages/easy_ham6	F	T	F	F	T	T	F	F	F	F	...	1	7.625272	13	15.21739	12.000000	3.956897	0

Quantifying the World¶

Case Study 6 - Spam¶

Question 19 - Recursive Partitioning and Classification Trees¶

Stacy Conant¶

Feb 17, 2020¶

¶

Contents¶

Introduction¶

Data¶

Method¶

Type I error¶

Type II error¶

It is more important to reduce Type I error because an email user would prefer a lower risk of legitimate mail being classified as spam and potentially sent to spam folder or deleted.¶

Baseline Model¶

The proportion of ham messages that have been misclassified as spam (Type I) is 0.065 or 6.5%. The proportion of spam messages misclassified as ham (Type II) is 0.186 or 18.6%.¶

This study will explore how these baseline error rates change when various parameters in rpart.contol() are manipulated.¶

Below is the tree produced for the spam data with the default control parameters set.¶

Figure 1: Tree to Predict Spam with Default Control Parameters. Baseline tree fitted using `rpart()`. </h3>¶

Next, rpart.control will be examined to assess which control parameters could be changed and how they will effect the spam idnetification trees and the error rates.¶

Complexity Parameter¶

Next, the Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

Figure 2: Plot of the Type I and II errors for predicting spam as a function of the size of the complexity parameter in the `rpart.control()` function.¶

The CP value that produces the lowest type I error rate (3.4%) is at 0.0015. The type II error at this CP value is 17.9%. The type I error at 0.0015 is lower than the baseline model, but the type II error is higher than the baseline model.¶

Figure 3: Tree to Predict Spam with Optimized CP Parameter set at 0.0015. This setting produces a very large tree compared to the default setting of 0.01. ¶

Minimum Split¶

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

Figure 4: Plot of the Type I and II errors for predicting spam as a function of the size of the minsplit parameter in the `rpart.control()` function.¶

The minsplit value that produces the lowest type I error rate (4.4%) is at 131. The type II error at this CP value is 27.7%. The type I error at 131 is lower than the baseline model, but the type II error is much higher than the baseline model.¶

Figure 5: Tree to Predict Spam with Optimized Minimum Split Parameter set at 131.</h3>¶

Minimum Buckets¶

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

Figure 6: Plot of the Type I and II errors for predicting spam as a function of the minimumm bucket parameter in the `rpart.control()` function.</h3>¶

Figure 7: Tree to Predict Spam with Optimized Minimum Bucket Parameter set at 44. Because minsplit and minbucket are related, the trees for these parameters are the exact same.</h3>¶

Maximum Compete¶

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

Figure 8: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum compete parameter in the `rpart.control()` function.¶

For values 1 - 10, the error rates do not change. That means that for this dataset, maxcompete can left at its default value of 4. The tree set at any value betwen 1 and 10 should look exactly the same as the default tree.¶

Figure 9: Tree to Predict Spam with Optimized Maximum Compete Parameter set at 5. </h3>¶

As expected, the tree is exactly the same with max compete set at 5 and it was at the default of 4.¶

Maximum Depth¶

Type I and II errors are plotted with a verticle dashed line denoting the lowest value of Type 1 error.¶

Figure 10: Plot of the Type I and II errors for predicting spam as a function of the size of the maximum depth parameter in the `rpart.control()` function.</h3>¶

Figure 11: Tree to Predict Spam with Optimized Maximum Depth Parameter set at 3. The optimized tree has a depth of 3 while the baseline tree has a depth of 8.¶

Optimized Model¶

Figure 12: Tree to Predict Spam with All Optimized Control Parameters.¶

Modified Optimized model¶

Figure 13: Tree to Predict Spam with Modified Optimized Parameters set. ¶

¶

Conclusion¶

References¶

This study will explore how these baseline error rates change when various parameters in `rpart.contol()` are manipulated.¶

Next, `rpart.control` will be examined to assess which control parameters could be changed and how they will effect the spam idnetification trees and the error rates.¶

The `CP` value that produces the lowest type I error rate (3.4%) is at 0.0015. The type II error at this `CP` value is 17.9%. The type I error at 0.0015 is lower than the baseline model, but the type II error is higher than the baseline model.¶

The `minsplit` value that produces the lowest type I error rate (4.4%) is at 131. The type II error at this CP value is 27.7%. The type I error at 131 is lower than the baseline model, but the type II error is much higher than the baseline model.¶