In data science, part of any project is processing and readying the data for analysis. This data cleaning process can be a challenge, especially scraping data from the internet. Data scientist can encounter a wide range in webpage structure, data quality and formatting. This case study will explore webscraping with R and the challenges that can arise from inconsistent data formatting.
To demostrate method for webscraping, data from an annual 10 mile race will be used. The Cherry Blossom Ten Mile Run and the festival that occurs every spring in the nation's capital commemorates the 1912 gift of 3,000 cherry trees from the city of Tokyo, Japan to Washington, DC. Thousands of people have run the ten miler since it's start in 1973. This study will scrape the results data for the 1999 to 2012 races from the run's website. In their book, Nolan and Lang demonstrate webscraping with race results for the men, giving a good foundation to begin this paper's tasks. Specifically, this paper will answer question 7:
After preparing the data, the relationship between age and run time for female runners will be explored using a simlpe linear model, loess and piecewise linear model.
The objective for this is study is to scrape the race data from the web and clean it for the female racers, just as Nolan and Lang demonstrated with the mens data. The challenge is that for the different genders and race years the data was recorded slightly differently and there are some mistakes in the data. Some issues are simple differences in format of the table headers and the use of footnotes. The types of mistakes include values that begin in the wrong column, missing headers, spacing problems and so on. These inconsistencies must be attended to before meaningful analysis can begin.
| Variable | Description |
|---|---|
| Place | Numeric value indicating what overall place the runner finished the race |
| Div/Tot | Numeric value indicating what place the runner finished the race within their division |
| Num | The runner's bib number |
| Name | The runner's first and last name |
| Ag | The runner's age |
| Hometown | City or country where the runner hails from |
| Net | Overall run time; Personal run time based on when the runner crosses the sensor at the start to when the runner crosses the sendor at the finish; Also marked at simply time in some years |
| Gun | The time from when the gun goes off to the time the runner crosses the finish line |
| Pace | Average minutes run per mile |
| Split | Runner's time at the 5 mile mark |
womenURLs. The base plus the 14 results URL extensions are combined and then assigned to urlsWomen.¶library(XML)
#main page for cherry blossom race
ubase = "http://www.cherryblossom.org/"
# pages for each of the female race results.
womenURLs = c(
"results/1999/cb99f.html",
"results/2000/Cb003f.htm",
"results/2001/oof_f.html",
"results/2002/ooff.htm",
"results/2003/CB03-F.HTM",
"results/2004/women.htm",
"results/2005/CB05-F.htm",
"results/2006/women.htm",
"results/2007/women.htm",
"results/2008/women.htm",
"results/2009/09cucb-F.htm",
"results/2010/2010cucb10m-f.htm",
"results/2011/2011cucb10m-f.htm",
"results/2012/2012cucb10m-f.htm"
)
#URLs for womens races
urlsWomen = paste(ubase, womenURLs, sep="")
#check urls
urlsWomen[1:14]
extractResTable() function is created to pull the results data and write it in to a text file. Within extractResTable XML’s htmlParse() function is used to scrape the women’s results for each year from the site. Because the data format changes slightly for year to year, conditional statements are included in to take in to account the variations. The function writeLines() is used to write the character vectors out as plain text files.¶extractResTable =
# takes a list of websites from the cherry blossom race
# a list of years corresponding to the year the result is for
# and the gender of the participant
# Retrieve data from web site,
# find the preformatted text,
# and write lines or return as a character vector.
# returns a list of strings corrsponding to lines in the web url
function(url = "http://www.cherryblossom.org/results/2009/09cucb-F.htm",
year = 1999, sex = "female", file = NULL)
{
#encoding as UTF-8 fixes A symbol issue
doc = htmlParse(url, encoding="UTF-8")
if (year == 2000) {
# Get preformatted text from 4th font element
# The top file is ill formed so the <pre> search doesn't work.
ff = getNodeSet(doc, "//font")
txt = xmlValue(ff[[4]])
els = strsplit(txt, "\r\n")[[1]]
}
else if (year == 2009 & sex == "male") {
# Get preformatted text from <div class="Section1"> element
# Each line of results is in a <pre> element
div1 = getNodeSet(doc, "//div[@class='Section1']")
pres = getNodeSet(div1[[1]], "//pre")
els = sapply(pres, xmlValue)
}
else if (year == 1999 & sex == "female") {
# Get preformatted text from <pre> elements
pres = getNodeSet(doc, "//pre")
txt = xmlValue(pres[[1]])
els = strsplit(txt, "\n")[[1]]
}
else {
# Get preformatted text from <pre> elements
pres = getNodeSet(doc, "//pre")
txt = xmlValue(pres[[1]])
els = strsplit(txt, "\r\n")[[1]]
}
if (is.null(file)) return(els)
# Write the lines as a text file.
writeLines(els, con = file)
}
extracResTable() function is applied to each of the women’s URLs with mapply(). The count of rows for each year is also output. Except for 2000, there is a steady increase in the number of female runners each year.¶years = 1999:2012
womenTables = mapply(extractResTable, url = urlsWomen, year = years)
names(womenTables) = years
sapply(womenTables, length)
save(womenTables, file = "WomenTextTables.rda")
write() function writes each year.¶# For storage of text files
dir.create(file.path(getwd(), "WomenTxt"))
# write each year to its own folder
write(x=womenTables$'2012',file="WomenTxt/2012.txt")
write(x=womenTables$'2011',file="WomenTxt/2011.txt")
write(x=womenTables$'2010',file="WomenTxt/2010.txt")
write(x=womenTables$'2009',file="WomenTxt/2009.txt")
write(x=womenTables$'2008',file="WomenTxt/2008.txt")
write(x=womenTables$'2007',file="WomenTxt/2007.txt")
write(x=womenTables$'2006',file="WomenTxt/2006.txt")
write(x=womenTables$'2005',file="WomenTxt/2005.txt")
write(x=womenTables$'2004',file="WomenTxt/2004.txt")
write(x=womenTables$'2003',file="WomenTxt/2003.txt")
write(x=womenTables$'2002',file="WomenTxt/2002.txt")
write(x=womenTables$'2001',file="WomenTxt/2001.txt")
write(x=womenTables$'2000',file="WomenTxt/2000.txt")
write(x=womenTables$'1999',file="WomenTxt/1999.txt")
#No header is present for the year 2001
womenTables$'2001'[2:3]
#2002 has a correct header that mataches format for 2001
womenTables$'2002'[2:3]
#appending 2002 header to 2001
womenTables$'2001'[2:3] = womenTables$'2002'[2:3]
#check 2001
womenTables$'2001'[2:3]
women2010 = readLines("WomenTxt/2010.txt")
women2010[1:12]
findColLocs() function will help to locate the beginning and end of each column so that the variables are identified. It does this by identifying where the blanks are. The selectCols() function pulls out the names of the columns for analysis, the header row that contains the column names, and the locations of the blanks in the separator row.¶findColLocs = function(spacerRow) {
spaceLocs = gregexpr(" ", spacerRow)[[1]]
rowLength = nchar(spacerRow)
if (substring(spacerRow, rowLength, rowLength) != " ")
return( c(0, spaceLocs, rowLength + 1))
else return(c(0, spaceLocs))
}
# Create function to extract specific cols from the data tables
selectCols =
function(colNames, headerRow, searchLocs)
{
sapply(colNames,
function(name, headerRow, searchLocs)
{
startPos = regexpr(name, headerRow)[[1]]
if (startPos == -1)
return( c(NA, NA) )
index = sum(startPos >= searchLocs)
c(searchLocs[index] + 1, searchLocs[index + 1] - 1)
},
headerRow = headerRow, searchLocs = searchLocs )
}
grep(), each .txt file can be searched through to locate the row with above the data. Once the row is located, it can be extracted with the row above that contains the column names. The rest can be discarded.¶#extract variables - identify desired columns
extractVariables =
function(file, varNames =c("name", "home", "ag", "gun",
"net", "time"))
{
# Find the index of the row with =s
eqIndex = grep("^===", file)
# Extract the two key rows and the data
spacerRow = file[eqIndex]
headerRow = tolower(file[ eqIndex - 1 ])
body = file[ -(1 : eqIndex) ]
# Obtain the starting and ending positions of variables
searchLocs = findColLocs(spacerRow)
locCols = selectCols(varNames, headerRow, searchLocs)
Values = mapply(substr, list(body), start = locCols[1, ],
stop = locCols[2, ])
colnames(Values) = varNames
invisible(Values)
}
extractVariable(). Then a data frame for each race year can be created by applying extractVariable() to womensFiles. The row counts are also displayed. There are fewer rows displayed now due to the removing of some header rows.¶wfilenames = paste("WomenTxt/", 1999:2012, ".txt", sep = "")
womenFiles = lapply(wfilenames, readLines)
names(womenFiles) = 1999:2012
womenResMat = lapply(womenFiles, extractVariables)
sapply(womenResMat, nrow)
#boxplot of female ages for each race year
# Convert to numeric
age = sapply(womenResMat,
function(x) as.numeric(x[ , 'ag']))
# View ages with boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year", main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()
selectCols() function that locates the end of a column to include the blank position.¶head(womenFiles[['2003']])
# Need to modify function to account for spacing issue
selectCols = function(shortColNames, headerRow, searchLocs) {
sapply(shortColNames, function(shortName, headerRow, searchLocs){
startPos = regexpr(shortName, headerRow)[[1]]
if (startPos == -1) return( c(NA, NA) )
index = sum(startPos >= searchLocs)
c(searchLocs[index] + 1, searchLocs[index + 1])
}, headerRow = headerRow, searchLocs = searchLocs )
}
# Re-run boxplots using the modified selectCols function from above.
womenResMat = lapply(womenFiles, extractVariables)
age = sapply(womenResMat,
function(x) as.numeric(x[ , 'ag']))
# Boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year", main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()
Which() is used to find the runner with age equal to zero. She is found to be runner in finishing place 2611. To locate her row, three is added to the place to account for the header.¶#locating record with 0 entry for age
age2001 = age[["2001"]]
which(age2001 == 0)
#Adding 3 places to the row to account for the header
womenFiles[['2001']][2614]
wfilenames = paste("WomenTxt/", 1999:2012, ".txt", sep = "")
womenFiles = lapply(wfilenames, readLines)
names(womenFiles) = 1999:2012
womenResMat = lapply(womenFiles, extractVariables)
sapply(womenResMat, nrow)
womenResMat = lapply(womenFiles, extractVariables)
age = sapply(womenResMat,
function(x) as.numeric(x[ , 'ag']))
womenFiles[['2001']][2614]
# Boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year", main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()
createTime, splitting the time into its individual components and handling run times that are under one hour. The : characters are discarded and the variable elements are converted to numeric. Then the time elements are combined in to a single value that reports the time in minutes.¶createDF, that determines which time to use (net, gun, or time), handles the footnote symbols (# and * ) that are in some time records, drops records of runners who did not finish the race, and exports the results to a data frame.¶# split hh:mm:ss into time segments
# Convert time to numeric
convertTime = function(time) {
timePieces = strsplit(time, ":")
timePieces = sapply(timePieces, as.numeric)
sapply(timePieces, function(x) {
if (length(x) == 2) x[1] + x[2]/60
else 60*x[1] + x[2] + x[3]/60
})
}
createDF = function(Res, year, sex)
{
# Determine which time to use - preference is for net time
if ( !is.na(Res[1, 'net']) ) useTime = Res[ , 'net']
else if ( !is.na(Res[1, 'gun']) ) useTime = Res[ , 'gun']
else useTime = Res[ , 'time']
# Removing #, * and blanks from time
useTime = gsub("[#\\*[:blank:]]", "", useTime)
runTime = convertTime(useTime[ useTime != "" ])
# Drop the records of those that did not finish race
Res = Res[ useTime != "", ]
Results = data.frame(year = rep(year, nrow(Res)),
sex = rep(sex, nrow(Res)),
name = Res[ , 'name'], home = Res[ , 'home'],
age = as.numeric(Res[, 'ag']),
runTime = runTime,
stringsAsFactors = FALSE)
invisible(Results)
}
cbWomen using do.call() function to call rbind() with the list of data frames as input.¶#combine and save all womens race results
cbWomen = do.call(rbind, womenDF)
save(cbWomen, file = "cbWomen.rda")
#check the cbWomen dimensions
dim(cbWomen)
#scatter plot of age and run time for women.
plot(runTime ~ age, data = cbWomen,
ylim = c(40, 180),
xlab = "Age (years)",
ylab = "Run Time (minutes)", main= "Scatter Plot of Run Time vs. Age for Female Runners")
smoothscatter() plot is displayed (Fig. 5).¶#load library
library(RColorBrewer)
ls("package:RColorBrewer")
#smooth scatter plot for run time v age
smoothScatter(y = cbWomen$runTime, x = cbWomen$age,
ylim = c(40, 165), xlim = c(15, 85),
xlab = "Age (years)",
ylab = "Run Time (minutes)",
main= "Scatter Plot of Run Time vs. Age for Female Runners")
#Load library
library(ggplot2)
#Plot count of female runners by age
ageCat = cut(cbWomenSub$age, breaks = c(seq(15, 75, 10), 90))
agebins=as.data.frame(table(ageCat))
ggplot(agebins, aes(x=ageCat, y=Freq)) + geom_bar(stat="identity") + ggtitle("Count of Female Runner Age Categories") + xlab('Age')
#Boxplot of run times by age
plot(cbWomenSub$runTime ~ ageCat, xlab = "Age (Years)", ylab = "Run Time (Minutes)")
title(main = "Female Runner Run Time By Age")
#simple linear regression
lmAge = lm(runTime ~ age, data = cbWomenSub)
lmAge
summary(lmAge)
loess() is used to find a pattern in the residuals by taking in to account local weighted averages of age. LOESS can be useful because it employs a parametric curve to visualize trends in data that might be difficult to model linearly.¶predict() on the loess fit.¶#loess on residuals of lmAge linear model
resid.lo = loess(resids ~ age, data = data.frame(resids = residuals(lmAge), age = cbWomenSub$age))
age20to80 = 20:80
resid.lo.pr = predict(resid.lo, newdata = data.frame(age = age20to80))
#plot of loess curve
smoothScatter(x = cbWomenSub$age, y = lmAge$residuals,
xlab = "Age (years)",
ylab = "Residuals",
main="Residual Plot from a Simple Linear Model of Performance to Age")
abline(h = 0, col = "purple", lwd = 3)
lines(x = age20to80, y = resid.lo.pr, col = "green", lwd = 3, lty = 2)
#loess model
womenRes.lo = loess(runTime ~ age, cbWomenSub)
womenRes.lo.pr = predict(womenRes.lo, data.frame(age = age20to80))
#piecewise linear model
#creating variables for age bins
decades = seq(30, 60, by = 10)
overAge = lapply(decades,
function(x) pmax(0, (cbWomenSub$age - x)))
names(overAge) = paste("over", decades, sep = "")
overAge = as.data.frame(overAge)
#least squares fit
lmPiecewise = lm(runTime ~ . ,
data = cbind(cbWomenSub[, c("runTime", "age")],
overAge))
summary(lmPiecewise)
#create a data frame of the covariates to plot
overAge20 = lapply(decades, function(x) pmax(0, (age20to80 - x)))
names(overAge20) = paste("over", decades, sep = "")
overAgeDF = cbind(age = data.frame(age = age20to80), overAge20)
predPiecewise = predict(lmPiecewise, overAgeDF)
#plot piecewise segments
plot(predPiecewise ~ age20to80,
type = "l", col = "purple", lwd = 3,
xlab = "Age (years)", ylab = "Run Time Prediction", main = "Piecewise Linear & Loess Curves for Run Time vs. Age for Women")
#plot loess curve
lines(x = age20to80, y = womenRes.lo.pr,
col = "green", lty = 2, lwd = 3)
legend("topleft", col = c("purple", "green"),
lty = c(1, 2), lwd= 3,
legend = c("Piecewise Linear", "Loess Curve"), bty = "n")
dev.off()