It is hard to imagine a productive work and social life without Email. Checking an email inbox is an essential part of many peoples daily routine. But this wonderful advancement in technology and productivity is not with out its drawbacks, namely spam, or the name given to the unwanted, obtrusive, and sometimes dangerous advertising emails that can over-run an inbox. Though it is usually quite easy to spot a spam email without opening it by it's subject or sender name, most email services have an automated procedure to identify a spam email and sort it to a spam folder.
This study will examine a dataset of over 9000 email messages and attempt to distinguish an authentic email, or ham, from an inauthentic email, or spam. To accomplish this, Recursive Partitioning and Classification Trees will be explored using the R package rpart. Recursive partitioning is a set of nonparametric techniques for classification and prediction. It is a useful method that can be easily visualized by the "trees" produced that display the predicted class or predicted value. The exploration will begin with the methods discussed in Nolan and Lang's Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving.
Specifically, this study will explore the following questions:
The data comes from the open-source Apache SpamAssassin Project. The dataset contains 9348 email messages that have already been extensively cleaned and prepared for analysis using Nolan and Lang's method. Below is the description of the variables in the dataset.
| Variable | Description |
|---|---|
| isRe | TRUE if Re: appears at the start of the subject. |
| numLines | Number of lines in the body of the message. |
| bodyCharCt | Number of characters in the body of the message. |
| underscore | TRUE if email address in the From field of the header contains an underscore. |
| subExcCt | Number of exclamation marks in the subject. |
| subQuesCt | Number of question marks in the subject. |
| numAtt | Number of attachments in the message. |
| priority | TRUE if a Priority key is present in the header. |
| numRec | Number of recipients of the message, including CCs. |
| perCaps | Percentage of capitals among all letters in the message body, excluding attachments. |
| isInReplyTo | TRUE if the In-Reply-To key is present in the header. |
| sortedRec | TRUE if the recipients’ email addresses are sorted. |
| subPunc | TRUE if words in the subject have punctuation or numbers embedded in them, e.g., w!se. |
| hour | Hour of the day in the Date field. |
| multipartText | TRUE if the MIME type is multipart/text. |
| hasImages | TRUE if the message contains images. |
| isPGPsigned | TRUE if the message contains a PGP signature. |
| perHTML | Percentage of characters in HTML tags in the message body in comparison to all characters. |
| subSpamWords | TRUE if the subject contains one of the words in a spam word vector. |
| subBlanks | Percentage of blanks in the subject. |
| noHost | TRUE if there is no hostname in the Message-Id key in the header. |
| numEnd | TRUE if the email sender’s address (before the @) ends in a number. |
| isYelling | TRUE if the subject is all capital letters. |
| forwards | Number of forward symbols in a line of the body, e.g., >>> xxx contains 3 forwards. |
| isOrigMsg | TRUE if the message body contains the phrase original message. |
| isDear | TRUE if the message body contains the word dear. |
| isWrote | TRUE if the message contains the phrase wrote:. |
| avgWordLen | The average length of the words in a message. |
| numDlr | Number of dollar signs in the message body. |
#load libraries and data
library(rpart.plot)
library(rattle)
library("IRdisplay")
load("spamData.rda")
#check data
head(emailDFrp)
#checking summary stats to get spam message numbers
summary(emailDFrp$isSpam)
#total number of emails, spams and non-spams(ham)
numEmail = 9348
numSpam = 2397
numHam = 6951
After loading the data and R packages, the task is to run a baseline classification tree using the rpart package. From there, alternate settings for rpart.control() will be systematically explored to assess their effect and usefulness on our model.
After running a recursive partitioning model, the predict() function will used to assess how well the classifier does at predicting if the test messages are legitimate (ham) or spam. The rpart fitted prediction are then compared to the actual message classifications and the various models are evaluated with Type I and Type II error. It is appropriate to review these terms.
Type I error is the rejection of a true null hypothesis as the result of a test procedure. In this case, it would be classifying a message as spam, when it is actually a legitimate email (ham), or losing a real email to the spam folder. A type I error is equivalent to a false positive.
Type II error is the failure to reject a false null hypothesis as the result of a test procedure. In this case, it would be classifying a spam message as a legitimate email (ham), or sending spam to your inbox. A type II error is equivalent to a false negative.
To run the baseline rpart model with default rpart.control() parameters, a seed is set for consistent results. Then, test indices are created for both spam and ham emails and the training and test sets are split into 2/3 training and 1/3 test sets. Finally, the recursive partitioning and classification tree is fit with rpart().
#Set seed for consistent results
set.seed(418910)
#create indices
testSpamIdx = sample(numSpam, size = floor(numSpam/3))
testHamIdx = sample(numHam, size = floor(numHam/3))
#create training and test splits
testDF =
rbind( emailDFrp[ emailDFrp$isSpam == "T", ][testSpamIdx, ],
emailDFrp[emailDFrp$isSpam == "F", ][testHamIdx, ] )
trainDF =
rbind( emailDFrp[emailDFrp$isSpam == "T", ][-testSpamIdx, ],
emailDFrp[emailDFrp$isSpam == "F", ][-testHamIdx, ])
#fit the classification tree
rpartFit = rpart(isSpam ~ ., data = trainDF, method = "class")
# to assess how well the classifier does at predicting if the test messages are spam or ham
predictions = predict(rpartFit,
newdata = testDF[, names(testDF) != "isSpam"],
type = "class")
summary(predsForHam)
#type I error
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Baseline Model is:")
sum(predsForHam == "T") / length(predsForHam)
#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Baseline Model is:")
sum(predsForSpam == "F") / length(predsForSpam)
rpart.contol() are manipulated.¶#no control params set
#fancyRpartPlot(rpartFit, extra = 1)
rpart.plot(rpartFit,extra = 1, main = "Tree to Predict Spam with Default Control Parameters")
rpart.control will be examined to assess which control parameters could be changed and how they will effect the spam idnetification trees and the error rates.¶#check control parameters defaults
args(rpart.control)
The complexity parameter, or CP, is a threshold for any split where the overall lack of fit not decreased by a factor of cp is not attempted. The main role of this parameter is to save computing time by pruning off splits that are not worth the effort.
To test multiple values of CP a list of values is created in complexityVals as was done in Nolan and Lang's work.
#list of values of CP
set.seed(418910)
complexityVals = c(seq(0.00001, 0.0001, length=19),
seq(0.0001, 0.001, length=19),
seq(0.001, 0.005, length=9),
seq(0.005, 0.01, length=9))
#function to apply each CP to rpart()
fits = lapply(complexityVals, function(x) {
rpartObj = rpart(isSpam ~., data = trainDF,
method="class",
control = rpart.control(cp=x))
predict(rpartObj, newdata = testDF[, names(testDF) != "isSpam"],
type = "class")
})
#type I and II error evaluation for all CP values
spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(fits, function(preds) {
typeI = sum(preds[ !spam] == "T")/ numHam
typeII = sum(preds[ spam] == "F")/ numSpam
c(typeI = typeI, typeII = typeII)
})
library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ complexityVals, type="l", col=cols[2],
lwd = 2, ylim = c(0,0.2), xlim = c(0,0.01),
ylab="Error", xlab="Complexity Parameter Values")
points(errs[2,] ~ complexityVals, type="l", col=cols[1], lwd = 2)
text(x =c(0.003, 0.003), y = c(0.12, 0.03),
labels=c("Type II Error", "Type I Error"))
minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = complexityVals[minI], col ="grey", lty =3, lwd=2)
text(0.0007, errs[1, minI]+0.01,
formatC(errs[1, minI], digits = 2))
text(0.0007, errs[2, minI]+0.01,
formatC(errs[2, minI], digits = 3))
print(complexityVals[minI])
CP value that produces the lowest type I error rate (3.4%) is at 0.0015. The type II error at this CP value is 17.9%. The type I error at 0.0015 is lower than the baseline model, but the type II error is higher than the baseline model.¶CPFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(cp = 0.0015))
rpart.plot(CPFit,extra = 1, main = "Tree to Predict Spam with Optimized CP Parameter")
The minimum split is the minimum number of observations that must exist in a node in order for a split to be attempted. As with the CP, multiple split values will be tested. The function is modified to try split minimums between 1 and 300.
splits = c(1:300)
splitFits = lapply(splits, function(x) {
rpartObj = rpart(isSpam ~ ., data = trainDF,
method="class",
control = rpart.control(minsplit=x) )
predict(rpartObj,
newdata = testDF[ , names(testDF) != "isSpam"],
type = "class")
})
spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(splitFits, function(preds) {
typeI = sum(preds[ !spam ] == "T") / numHam
typeII = sum(preds[ spam ] == "F") / numSpam
c(typeI = typeI, typeII = typeII)
})
library(RColorBrewer)
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ splits, type="l", col=cols[2],
lwd = 2, ylim = c(0,0.3), xlim = c(0,300),
ylab="Error", xlab="Minimum Split Parameter Values")
points(errs[2,] ~ splits, type="l", col=cols[1], lwd = 2)
text(x =c(50, 50), y = c(0.16, 0.04),
labels=c("Type II Error", "Type I Error"))
minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = splits[minI], col ="grey", lty =3, lwd=2)
text(120, errs[1, minI]+0.01,
formatC(errs[1, minI], digits = 2))
text(120, errs[2, minI]+0.01,
formatC(errs[2, minI], digits = 3))
print(splits[minI])
minsplit value that produces the lowest type I error rate (4.4%) is at 131. The type II error at this CP value is 27.7%. The type I error at 131 is lower than the baseline model, but the type II error is much higher than the baseline model.¶SplitFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minsplit = 131))
rpart.plot(SplitFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Split Parameter")
The minimum bucket parameter controls the minimum number of observations in any terminal leaf node. This parameter seems to be related to minimum split. The function is modified to try bucket minimums between 1 and 150.
#modification for minimum buckets
buckets = c(1:150)
bucketFits = lapply(buckets, function(x) {
rpartObj = rpart(isSpam ~ ., data = trainDF,
method="class",
control = rpart.control(minbucket=x) )
predict(rpartObj,
newdata = testDF[ , names(testDF) != "isSpam"],
type = "class")
})
spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(bucketFits, function(preds) {
typeI = sum(preds[ !spam ] == "T") / numHam
typeII = sum(preds[ spam ] == "F") / numSpam
c(typeI = typeI, typeII = typeII)
})
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ buckets, type="l", col=cols[2],
lwd = 2, ylim = c(0,0.45), xlim = c(0,150),
ylab="Error", xlab="minbucket parameter values")
points(errs[2,] ~ buckets, type="l", col=cols[1], lwd = 2)
text(x =c(10, 11), y = c(0.15, 0.035),
labels=c("Type II Error", "Type I Error"))
minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = buckets[minI], col ="grey", lty =3, lwd=2)
text(35, errs[1, minI]+0.01,
formatC(errs[1, minI], digits = 2))
text(35, errs[2, minI]+0.01,
formatC(errs[2, minI], digits = 3))
title("Minbucket Error Rates")
print(buckets[minI])
rpart.control documentation it is revealed that the buckets and splits are related. The minimum bucket is one third of the amount of the minimum split. Because of this, in the final model, it should only be necessary to specify one of those parameters, either splits or buckets.¶#classification tree
bucketFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(minbucket = 44))
rpart.plot(bucketFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")
The maxcompete parameter sets the number of competitor splits retained in the output. The function is modified to try compete maximums between 1 and 10.
#modification for max compete
compete = c(1:10)
comfits = lapply(compete, function(x) {
rpartObj = rpart(isSpam ~ ., data = trainDF,
method="class",
control = rpart.control(maxcompete=x) )
predict(rpartObj,
newdata = testDF[ , names(testDF) != "isSpam"],
type = "class")
})
spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(comfits, function(preds) {
typeI = sum(preds[ !spam ] == "T") / numHam
typeII = sum(preds[ spam ] == "F") / numSpam
c(typeI = typeI, typeII = typeII)
})
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ compete, type="l", col=cols[2],
lwd = 2, ylim = c(0,0.7), xlim = c(0,10),
ylab="Error", xlab="Maxcomplete Parameter Values")
points(errs[2,] ~ compete, type="l", col=cols[1], lwd = 2)
text(x =c(2.2, 2.2), y = c(0.65, 0.15),
labels=c("Type II Error", "Type I Error"))
minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = compete[minI], col ="grey", lty =3, lwd=2)
text(2.3, errs[1, minI]+0.01,
formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01,
formatC(errs[2, minI], digits = 3))
title("Maxcompete Error Rates")
print(compete[minI])
#tree graph
competeFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxcompete = 5))
rpart.plot(competeFit,extra = 1, main = "Tree to Predict Spam with Optimized Minimum Bucket Parameter")
The maxdepth parameter sets the maximum depth of any node of the final tree, with the root node counted as depth 0. This is a useful parameter that can be systematically incremented to aid in understanding variable importance in a model. Here, the function is modified to try maximum tree depths between 1 and 30.
#modification for max depth
depth = c(1:30)
fits = lapply(depth, function(x) {
rpartObj = rpart(isSpam ~ ., data = trainDF,
method="class",
control = rpart.control(maxdepth=x) )
predict(rpartObj,
newdata = testDF[ , names(testDF) != "isSpam"],
type = "class")
})
spam = testDF$isSpam == "T"
numSpam = sum(spam)
numHam = sum(!spam)
errs = sapply(fits, function(preds) {
typeI = sum(preds[ !spam ] == "T") / numHam
typeII = sum(preds[ spam ] == "F") / numSpam
c(typeI = typeI, typeII = typeII)
})
#graph of errors
cols = brewer.pal(9, "Set1")[c(3, 4, 5)]
plot(errs[1,] ~ depth, type="l", col=cols[2],
lwd = 2, ylim = c(0,0.7), xlim = c(0,15),
ylab="Error", xlab="maxdepth parameter values")
points(errs[2,] ~ depth, type="l", col=cols[1], lwd = 2)
text(x =c(2.2, 2.2), y = c(0.65, 0.15),
labels=c("Type II Error", "Type I Error"))
minI = which(errs[1,] == min(errs[1,]))[1]
abline(v = depth[minI], col ="grey", lty =3, lwd=2)
text(2.3, errs[1, minI]+0.01,
formatC(errs[1, minI], digits = 2))
text(2.3, errs[2, minI]+0.01,
formatC(errs[2, minI], digits = 3))
title("Maxdepth Error Rates")
print(depth[minI])
#graph of tree
depthFit = rpart(isSpam ~ ., data = trainDF, method = "class", control = rpart.control(maxdepth = 3))
rpart.plot(depthFit,extra = 1, main = "Tree to Predict Spam with Optimized Maximum Depth Parameter")
rpart model will be run with the values of the lowest Type I Error from our indivdual control parameter testing. Maxcomplete and minbucket can be removed because they do not impact the outcome.¶#fit for optimized model
rpartFit1 = rpart(isSpam ~ ., data = trainDF, method = "class",
control = rpart.control(cp = 0.0015,
minsplit=131,
#minbucket=44,
maxdepth=3,
#maxcompete=1,
))
predictions = predict(rpartFit1,
newdata = testDF[, names(testDF) != "isSpam"],
type = "class")
#type I error
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Optimized Model is:")
sum(predsForHam == "T") / length(predsForHam)
#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Optimized Model is:")
sum(predsForSpam == "F") / length(predsForSpam)
#tree for optimized model
rpart.plot(rpartFit1,extra = 1, main = "Tree to Predict Spam with Optimized Control Parameters")
cp remained at 0.0015, but the minsplit and maxdepth were altered slightly.¶#modified optimized model
rpartFit2 = rpart(isSpam ~ ., data = trainDF, method = "class",
control = rpart.control(cp = 0.0015,
minsplit=125,#change from 131
maxdepth=10))#change from 3
predictions = predict(rpartFit2,
newdata = testDF[, names(testDF) != "isSpam"],
type = "class")
#type I error
predsForHam = predictions[ testDF$isSpam == "F" ]
print("Type I Error for Modified Optimized Model is:")
sum(predsForHam == "T") / length(predsForHam)
#type 2 error
predsForSpam = predictions[ testDF$isSpam == "T" ]
print("Type II Error for Modified Optimized Model is:")
sum(predsForSpam == "F") / length(predsForSpam)
#tree plot for modified optimized model
rpart.plot(rpartFit2,extra = 1, main = "Tree to Predict Spam with Modified Optimized Control Parameters")
This study undertook an examination of the role that the control parameters play in building decision trees and using the rpart package. Using a dataset of email messages classified as spam or ham, multiple values for the individual rpart.control() parameters were assessed by their error rates. A new optimized model was then created that improved the type I error from that of the baseline model with default control parameters, but worsened the type II error rate. Further tuning of the control parameters produced a model with a more acceptable compromise in type II and II error rates.
| Model | Type I | Type II |
|---|---|---|
| Baseline | 0.065 | 0.186 |
| Optimized | 0.023 | 0.481 |
| Modified Optimized | 0.040 | 0.249 |
Further study could be done in to the role that some of the other control parameters play including maxsurrogate, usesurrogate, xval, and surrogatestyle. Additionally, study of the parms and cost parameters could be useful.