Quantifying the World¶

Case Study 4 - Cherry Blossom Race¶

Question 7¶

Stacy Conant¶

Feb 10, 2020¶

¶

Contents¶

Introduction
Data
Method
Conclusions
References

Back to Top

Introduction¶

In data science, part of any project is processing and readying the data for analysis. This data cleaning process can be a challenge, especially scraping data from the internet. Data scientist can encounter a wide range in webpage structure, data quality and formatting. This case study will explore webscraping with R and the challenges that can arise from inconsistent data formatting.

To demostrate method for webscraping, data from an annual 10 mile race will be used. The Cherry Blossom Ten Mile Run and the festival that occurs every spring in the nation's capital commemorates the 1912 gift of 3,000 cherry trees from the city of Tokyo, Japan to Washington, DC. Thousands of people have run the ten miler since it's start in 1973. This study will scrape the results data for the 1999 to 2012 races from the run's website. In their book, Nolan and Lang demonstrate webscraping with race results for the men, giving a good foundation to begin this paper's tasks. Specifically, this paper will answer question 7:

Follow the approach developed in Nolan and Lang's Data Science in R to read the files for the female runners
Process the female data create a data frame for analysis.
Generalize the createDF() and extractVariables() functions to handle additional oddities in the raw text files.

After preparing the data, the relationship between age and run time for female runners will be explored using a simlpe linear model, loess and piecewise linear model.

Back to Top

Data¶

The objective for this is study is to scrape the race data from the web and clean it for the female racers, just as Nolan and Lang demonstrated with the mens data. The challenge is that for the different genders and race years the data was recorded slightly differently and there are some mistakes in the data. Some issues are simple differences in format of the table headers and the use of footnotes. The types of mistakes include values that begin in the wrong column, missing headers, spacing problems and so on. These inconsistencies must be attended to before meaningful analysis can begin.

Variable	Description
Place	Numeric value indicating what overall place the runner finished the race
Div/Tot	Numeric value indicating what place the runner finished the race within their division
Num	The runner's bib number
Name	The runner's first and last name
Ag	The runner's age
Hometown	City or country where the runner hails from
Net	Overall run time; Personal run time based on when the runner crosses the sensor at the start to when the runner crosses the sendor at the finish; Also marked at simply `time` in some years
Gun	The time from when the gun goes off to the time the runner crosses the finish line
Pace	Average minutes run per mile
Split	Runner's time at the 5 mile mark

Back to Top

Method¶

Back to Top

Webscraping¶

The first step is to find the website and retrieve the target data. The Cherry Blossom Race website is easy to navigate and locate the race results pages. The race’s main page is used as the base for extracting each year’s race results. The URL extenstions for the individual years are manually entered, as they all have a slightly different format, and they are saveed at a character vector to `womenURLs`. The base plus the 14 results URL extensions are combined and then assigned to `urlsWomen`.¶

library(XML)

#main page for cherry blossom race
ubase = "http://www.cherryblossom.org/"

# pages for each of the female race results.
womenURLs = c(
    "results/1999/cb99f.html",
    "results/2000/Cb003f.htm",
    "results/2001/oof_f.html",
    "results/2002/ooff.htm", 
    "results/2003/CB03-F.HTM",
    "results/2004/women.htm", 
    "results/2005/CB05-F.htm", 
    "results/2006/women.htm", 
    "results/2007/women.htm", 
    "results/2008/women.htm", 
    "results/2009/09cucb-F.htm",
    "results/2010/2010cucb10m-f.htm", 
    "results/2011/2011cucb10m-f.htm",
    "results/2012/2012cucb10m-f.htm"
)

#URLs for womens races
urlsWomen = paste(ubase, womenURLs, sep="")
#check urls
urlsWomen[1:14]

The `extractResTable()` function is created to pull the results data and write it in to a text file. Within `extractResTable` XML’s htmlParse() function is used to scrape the women’s results for each year from the site. Because the data format changes slightly for year to year, conditional statements are included in to take in to account the variations. The function `writeLines()` is used to write the character vectors out as plain text files.¶

extractResTable =
  # takes a list of websites from the cherry blossom race
  # a list of years corresponding to the year the result is for
  # and the gender of the participant
  # Retrieve data from web site, 
  # find the preformatted text,
  # and write lines or return as a character vector.
  # returns a list of strings corrsponding to lines in the web url
  
  function(url = "http://www.cherryblossom.org/results/2009/09cucb-F.htm",
           year = 1999, sex = "female", file = NULL)
  {    
    #encoding as UTF-8 fixes A symbol issue
    doc = htmlParse(url, encoding="UTF-8")

    if (year == 2000) {
      # Get preformatted text from 4th font element
      # The top file is ill formed so the <pre> search doesn't work.
      ff = getNodeSet(doc, "//font")
      txt = xmlValue(ff[[4]])
      els = strsplit(txt, "\r\n")[[1]]
    }
    else if (year == 2009 & sex == "male") {
      # Get preformatted text from <div class="Section1"> element
      # Each line of results is in a <pre> element
      div1 = getNodeSet(doc, "//div[@class='Section1']")
      pres = getNodeSet(div1[[1]], "//pre")
      els = sapply(pres, xmlValue)
    }
     else if (year == 1999 & sex == "female") {
     # Get preformatted text from <pre> elements
      pres = getNodeSet(doc, "//pre")
      txt = xmlValue(pres[[1]])
      els = strsplit(txt, "\n")[[1]]  
     } 
    else {
      # Get preformatted text from <pre> elements
      pres = getNodeSet(doc, "//pre")
      txt = xmlValue(pres[[1]])
      els = strsplit(txt, "\r\n")[[1]]   
    } 
    
    if (is.null(file)) return(els)
    # Write the lines as a text file.
    writeLines(els, con = file)
  }

Next, the `extracResTable()` function is applied to each of the women’s URLs with `mapply()`. The count of rows for each year is also output. Except for 2000, there is a steady increase in the number of female runners each year.¶

years = 1999:2012
womenTables = mapply(extractResTable, url = urlsWomen, year = years)
names(womenTables) = years
sapply(womenTables, length)
save(womenTables, file = "WomenTextTables.rda")

Next, a file directory is created to house a text file for each year of women's race results. The `write()` function writes each year.¶

# For storage of text files
dir.create(file.path(getwd(), "WomenTxt"))

# write each year to its own folder
write(x=womenTables$'2012',file="WomenTxt/2012.txt")
write(x=womenTables$'2011',file="WomenTxt/2011.txt")
write(x=womenTables$'2010',file="WomenTxt/2010.txt")
write(x=womenTables$'2009',file="WomenTxt/2009.txt")
write(x=womenTables$'2008',file="WomenTxt/2008.txt")
write(x=womenTables$'2007',file="WomenTxt/2007.txt")
write(x=womenTables$'2006',file="WomenTxt/2006.txt")
write(x=womenTables$'2005',file="WomenTxt/2005.txt")
write(x=womenTables$'2004',file="WomenTxt/2004.txt")
write(x=womenTables$'2003',file="WomenTxt/2003.txt")
write(x=womenTables$'2002',file="WomenTxt/2002.txt")
write(x=womenTables$'2001',file="WomenTxt/2001.txt")
write(x=womenTables$'2000',file="WomenTxt/2000.txt")
write(x=womenTables$'1999',file="WomenTxt/1999.txt")

Warning message in dir.create(file.path(getwd(), "WomenTxt")):
"'C:\Users\JPH\WomenTxt' already exists"

Visual inspection of the text files shows that 2001 does not have a header for the columns. The format of the 2002 file is correct and contains the same columns, so its header can be applied to 2001.¶

#No header is present for the year 2001
womenTables$'2001'[2:3]

#2002 has a correct header that mataches format for 2001
womenTables$'2002'[2:3]

#appending 2002 header to 2001
womenTables$'2001'[2:3] = womenTables$'2002'[2:3]
#check 2001
womenTables$'2001'[2:3]

Next, it is a good idea to take a look at the first few rows of data to get better understanding of the formatting. Below, is the example from 2010, but each year has been reviewed. Some years include different columns for race times (net time vs time), or the addition of gun time and splits.¶

women2010 = readLines("WomenTxt/2010.txt")
women2010[1:12]

The `findColLocs()` function will help to locate the beginning and end of each column so that the variables are identified. It does this by identifying where the blanks are. The `selectCols()` function pulls out the names of the columns for analysis, the header row that contains the column names, and the locations of the blanks in the separator row.¶

findColLocs = function(spacerRow) {

  spaceLocs = gregexpr(" ", spacerRow)[[1]]
  rowLength = nchar(spacerRow)

  if (substring(spacerRow, rowLength, rowLength) != " ")
    return( c(0, spaceLocs, rowLength + 1))
  else return(c(0, spaceLocs))
}

# Create function to extract specific cols from the data tables
selectCols = 
function(colNames, headerRow, searchLocs) 
{
  sapply(colNames, 
         function(name, headerRow, searchLocs)
         {
           startPos = regexpr(name, headerRow)[[1]]
           if (startPos == -1) 
             return( c(NA, NA) )
    
           index = sum(startPos >= searchLocs)
           c(searchLocs[index] + 1, searchLocs[index + 1] - 1)
         },
         headerRow = headerRow, searchLocs = searchLocs )
}

Next, the extractVariables() function is used to select the variables of interest, which may vary from year to year, as seen from earlier visual inspection.¶

Since each .txt file is slightly different, a technique must be found to locate the data. The row at the bottom of the header contains repeating equal signs "=". The row under these equal signs is where the data begins. Using `grep()`, each .txt file can be searched through to locate the row with above the data. Once the row is located, it can be extracted with the row above that contains the column names. The rest can be discarded.¶

#extract variables - identify desired columns
extractVariables = 
  function(file, varNames =c("name", "home", "ag", "gun",
                             "net", "time"))
{
       # Find the index of the row with =s
  eqIndex = grep("^===", file)
       # Extract the two key rows and the data
  spacerRow = file[eqIndex] 
  headerRow = tolower(file[ eqIndex - 1 ])
  body = file[ -(1 : eqIndex) ]
       
       # Obtain the starting and ending positions of variables
  searchLocs = findColLocs(spacerRow)
  locCols = selectCols(varNames, headerRow, searchLocs)

  Values = mapply(substr, list(body), start = locCols[1, ], 
                  stop = locCols[2, ])
  colnames(Values) = varNames
  
  invisible(Values)
}

Next the womens data is made into character vectors so that it can be passed to `extractVariable()`. Then a data frame for each race year can be created by applying `extractVariable()` to `womensFiles`. The row counts are also displayed. There are fewer rows displayed now due to the removing of some header rows.¶

wfilenames = paste("WomenTxt/", 1999:2012, ".txt", sep = "")
womenFiles = lapply(wfilenames, readLines)
names(womenFiles) = 1999:2012

womenResMat = lapply(womenFiles, extractVariables)
sapply(womenResMat, nrow)

Back to Top

Cleaning Age¶

Now that the data is extracted, one can begin looking at variables of interest. To assess runner age, it must first be converted from a character data type to numeric. Then it can be displayed in a boxplot by year (Fig. 1).¶

#boxplot of female ages for each race year

# Convert to numeric
age = sapply(womenResMat,
             function(x) as.numeric(x[ , 'ag']))

# View ages with boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year", main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()

Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"

Figure 1: Boxplot of Female Runners by Year. 2003 ages are skewed to very young ages. This must be investigated and corrected. </h3>

It is obvious that there an issue with the ages in 2003. The ages are skewed very young. Upon visual inspection, it is clear that the age values are shifted to the right one space in comparison to the location of the ‘=’ characters, so it is only reading one digit of the two digit age.¶

To fix this issue, modifications are made to the `selectCols()` function that locates the end of a column to include the blank position.¶

head(womenFiles[['2003']])

# Need to modify function to account for spacing issue
selectCols = function(shortColNames, headerRow, searchLocs) {
  sapply(shortColNames, function(shortName, headerRow, searchLocs){
    startPos = regexpr(shortName, headerRow)[[1]]
    if (startPos == -1) return( c(NA, NA) )
    index = sum(startPos >= searchLocs)
    c(searchLocs[index] + 1, searchLocs[index + 1])
  }, headerRow = headerRow, searchLocs = searchLocs )
}

The boxplot of female runner age is then re-run to check that the modifications applied have been successful (Fig.2).¶

# Re-run boxplots using the modified selectCols function from above.
womenResMat = lapply(womenFiles, extractVariables)

age = sapply(womenResMat, 
             function(x) as.numeric(x[ , 'ag']))

# Boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year",  main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()

Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"

Figure 2: Boxplot of Female Runners by Year. 2003 ages have been corrected. </h3>

But it seems that in 2001, there was one runner whose age was reported as zero. `Which()` is used to find the runner with age equal to zero. She is found to be runner in finishing place 2611. To locate her row, three is added to the place to account for the header.¶

To rectify this error, this runner could be removed from the data file or an age could be imputed. Upon futher investigation of race results at http://www.cballtimeresults.org/runners/loretta-cuce, it is found that the runner, Loretta Cuce, has run the race before and reported her age; her correct age can then be inferred. The appropriate age for Loretta Cuce in 2001 would be 55.¶

#locating record with 0 entry for age
age2001 = age[["2001"]]
which(age2001 == 0)
#Adding 3 places to the row to account for the header
womenFiles[['2001']][2614]

The age for Loretta Cuce is manually imputed in to the 2001.txt file and code is re-run.¶

wfilenames = paste("WomenTxt/", 1999:2012, ".txt", sep = "")
womenFiles = lapply(wfilenames, readLines)
names(womenFiles) = 1999:2012

womenResMat = lapply(womenFiles, extractVariables)
sapply(womenResMat, nrow)

womenResMat = lapply(womenFiles, extractVariables)

age = sapply(womenResMat, 
             function(x) as.numeric(x[ , 'ag']))
             
             
womenFiles[['2001']][2614]

Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"Warning message in FUN(X[[i]], ...):
"NAs introduced by coercion"

Loretta Cuce's age is now correctly reported as 55. Re-running boxplot code to check that the 0 age has been corrected (Fig. 3).¶

# Boxplot
oldPar = par(mar = c(4.1, 4.1, 1, 1))
boxplot(age, ylab = "Age", xlab = "Year",  main="Boxplot of Female Runner Age by Year")
par(oldPar)
dev.off()

Figure 3: Boxplot of Female Runners by Year. 2001 missing age has been corrected. </h3>

Back to Top

Cleaning Run Time¶

The next task is to convert the time variable. It is formatted in hours, minutes and seconds, but will be converted to minutes. This is done done by creating a function, `createTime`, splitting the time into its individual components and handling run times that are under one hour. The `:` characters are discarded and the variable elements are converted to numeric. Then the time elements are combined in to a single value that reports the time in minutes.¶

Then we create a function, `createDF`, that determines which time to use (net, gun, or time), handles the footnote symbols (# and * ) that are in some time records, drops records of runners who did not finish the race, and exports the results to a data frame.¶

# split hh:mm:ss into time segments
# Convert time to numeric
convertTime = function(time) {
  timePieces = strsplit(time, ":") 
  timePieces = sapply(timePieces, as.numeric) 
  sapply(timePieces, function(x) {
                      if (length(x) == 2) x[1] + x[2]/60
                      else 60*x[1] + x[2] + x[3]/60
                      }) 
}

createDF = function(Res, year, sex) 
{
  # Determine which time to use - preference is for net time
  if ( !is.na(Res[1, 'net']) ) useTime = Res[ , 'net']
  else if ( !is.na(Res[1, 'gun']) ) useTime = Res[ , 'gun']
  else useTime = Res[ , 'time']
  
        # Removing #, * and blanks from time
  useTime = gsub("[#\\*[:blank:]]", "", useTime)
  runTime = convertTime(useTime[ useTime != "" ])
  
        # Drop the records of those that did not finish race
  Res = Res[ useTime != "", ]
  
  Results = data.frame(year = rep(year, nrow(Res)),
                       sex = rep(sex, nrow(Res)),
                       name = Res[ , 'name'], home = Res[ , 'home'],
                       age = as.numeric(Res[, 'ag']), 
                       runTime = runTime,
                       stringsAsFactors = FALSE)
  invisible(Results)
}

After that data has been examined, cleaned and formatted, all years can be saved to `cbWomen` using `do.call()` function to call `rbind()` with the list of data frames as input.¶

#combine and save all womens race results
cbWomen = do.call(rbind, womenDF)  
save(cbWomen, file = "cbWomen.rda")

#check the cbWomen dimensions
dim(cbWomen)

Back to Top

Analysis¶

Now it is possible to study the relationship between age and run time for women racers between 1999 and 2012 by running a scatter plot for age vs run time (Fig. 4).¶

#scatter plot of age and run time for women.

plot(runTime ~ age, data = cbWomen, 
     ylim = c(40, 180),  
     xlab = "Age (years)", 
     ylab = "Run Time (minutes)", main= "Scatter Plot of Run Time vs. Age for Female Runners")

Figure 4: Scatterplot of Run time vs. Age for Female Runners - all years. </h3>

To better see and understand the relationship between age and run time, the color brewer library is loaded and a `smoothscatter()` plot is displayed (Fig. 5).¶

#load library
library(RColorBrewer)  
ls("package:RColorBrewer")

#smooth scatter plot for run time v age
smoothScatter(y = cbWomen$runTime, x = cbWomen$age,  
              ylim = c(40, 165), xlim = c(15, 85),  
              xlab = "Age (years)", 
              ylab = "Run Time (minutes)",
              main= "Scatter Plot of Run Time vs. Age for Female Runners")

Figure 5: Smooth Scatterplot of Run time vs. Age for Female Runners - all years. </h3>

There is a age concentration of runners 20 - 30 years old and with run times concentrated between 80 to 120 minutes.¶

Further insight can be gleaned from a count of runners by age (fig. 6) and a breakout of run times by age category (fig. 7).¶

#Load library
library(ggplot2)
#Plot count of female runners by age
ageCat = cut(cbWomenSub$age, breaks = c(seq(15, 75, 10), 90))
agebins=as.data.frame(table(ageCat))

ggplot(agebins, aes(x=ageCat, y=Freq)) + geom_bar(stat="identity") + ggtitle("Count of Female Runner Age Categories") + xlab('Age')

Figure 6: Count of Female Runner Age by 10 year category </h3>

The majority of female runners are between 25 and 35 years old.¶

#Boxplot of run times by age
plot(cbWomenSub$runTime ~ ageCat, xlab = "Age (Years)", ylab = "Run Time (Minutes)")
title(main = "Female Runner Run Time By Age")

Figure 7: Boxplot of Female Run Times by Age Category </h3>

It appears that the fastest runner were between 15 and 25 years old.¶

Next, a linear model can be run to attempt to capture the relationship between female age and run time. The `lm()` function is used to fit run time as a function of age.¶

#simple linear regression
lmAge = lm(runTime ~ age, data = cbWomenSub) 

lmAge

Call:
lm(formula = runTime ~ age, data = cbWomenSub)

Coefficients:
(Intercept)          age  
    91.3290       0.1994

summary(lmAge)

Call:
lm(formula = runTime ~ age, data = cbWomenSub)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.093  -9.445  -0.648   8.639  79.208 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.329004   0.195627  466.85   <2e-16 ***
age          0.199433   0.005572   35.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.1 on 75821 degrees of freedom
Multiple R-squared:  0.01661,	Adjusted R-squared:  0.0166 
F-statistic:  1281 on 1 and 75821 DF,  p-value: < 2.2e-16

The result finds a significant relationship between run time and age and for every year increase in age, a female runner run time increase by 0.1994 minutes. But this does not tell us where the change points are in the data; at what ages does slow down in run time vary. Is it strictly linear or is it non-linear.¶

To find investigate the relationship, `loess()` is used to find a pattern in the residuals by taking in to account local weighted averages of age. LOESS can be useful because it employs a parametric curve to visualize trends in data that might be difficult to model linearly.¶

A vector of the age values from 20 to 80 years is then created and predicted values are obtained from running `predict()` on the loess fit.¶

#loess on residuals of lmAge linear model
resid.lo = loess(resids ~ age,  data = data.frame(resids = residuals(lmAge),  age = cbWomenSub$age)) 
age20to80 = 20:80 
resid.lo.pr =  predict(resid.lo, newdata = data.frame(age = age20to80))

#plot of loess curve
smoothScatter(x = cbWomenSub$age, y = lmAge$residuals,  
              xlab = "Age (years)", 
              ylab = "Residuals",  
              main="Residual Plot from a Simple Linear Model of Performance to Age")  
abline(h = 0, col = "purple", lwd = 3) 
lines(x = age20to80, y = resid.lo.pr,  col = "green", lwd = 3, lty = 2)

Figure 8: Residual Plot from Fitting a Simple Linear Model of Performance to Age for women. The purple lines is a marking y = 0. The green dashed curve is a local smooth of the residuals.</h3>

The curve upward in residuals starting at about age 50 tell us a that simple linear model tends to underestimate the run time for older women.¶

Loess is used again to try to achieve a better fit, but not on the residuals this time. The loess model takes locally weighted averages of time as the ages varies from 20 to 80 to produce a non-parametric smooth curve.¶

A piecewise linear model is also fit. Instead of a curved line, the piecewise model will consist of linear segments between hinge points at age 30, 40, 50 and 60. Following Nolan and Lang's example with the mens data, variables for year groups are created (over30, over40, over50, etc.)¶

#loess model
womenRes.lo = loess(runTime ~ age, cbWomenSub) 
womenRes.lo.pr = predict(womenRes.lo, data.frame(age = age20to80)) 

#piecewise linear model
#creating variables for age bins
decades = seq(30, 60, by = 10)
overAge = lapply(decades, 
                 function(x) pmax(0, (cbWomenSub$age - x)))
names(overAge) = paste("over", decades, sep = "")
overAge = as.data.frame(overAge)

#least squares fit
lmPiecewise = lm(runTime ~ . , 
                 data = cbind(cbWomenSub[, c("runTime", "age")], 
                              overAge))

summary(lmPiecewise)
#create a data frame of the covariates to plot
overAge20 = lapply(decades, function(x) pmax(0, (age20to80 - x)))
names(overAge20) = paste("over", decades, sep = "")
overAgeDF = cbind(age = data.frame(age = age20to80), overAge20)

predPiecewise = predict(lmPiecewise, overAgeDF)

Call:
lm(formula = runTime ~ ., data = cbind(cbWomenSub[, c("runTime", 
    "age")], overAge))

Residuals:
    Min      1Q  Median      3Q     Max 
-46.020  -9.435  -0.667   8.643  79.517 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 87.44962    0.67289 129.962   <2e-16 ***
age          0.35778    0.02490  14.370   <2e-16 ***
over30      -0.39441    0.03898 -10.118   <2e-16 ***
over40       0.38498    0.04218   9.127   <2e-16 ***
over50       0.10108    0.06885   1.468    0.142    
over60      -0.02632    0.13087  -0.201    0.841    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.08 on 75817 degrees of freedom
Multiple R-squared:  0.0197,	Adjusted R-squared:  0.01963 
F-statistic: 304.7 on 5 and 75817 DF,  p-value: < 2.2e-16

The piecewise linear model indicates significant relationships for each year group except for over 50 and over 60. The results indicate a slow down in run times as female runners age, except for between the ages of 30 and 40 years. Those runners between 30 and 40 actually decrease their run time by 0.3944 minutes more per year over 30, until age 40 when the run times begin to increase again.¶

Next the loess curve and piecewise models are plotted together for comparison.¶

#plot piecewise segments
plot(predPiecewise ~ age20to80,
     type = "l", col = "purple", lwd = 3,
     xlab = "Age (years)", ylab = "Run Time Prediction", main = "Piecewise Linear & Loess Curves for Run Time vs. Age for Women")
#plot loess curve
lines(x = age20to80, y = womenRes.lo.pr, 
      col = "green", lty = 2, lwd = 3)
legend("topleft", col = c("purple", "green"),
       lty = c(1, 2), lwd= 3,
       legend = c("Piecewise Linear", "Loess Curve"), bty = "n")

dev.off()

Figure 9: Piecewise Linear and Loess Curves Fitted to Run Time vs. Age for Women</h3>

The two curves follow each other quite closely, though the linear model can not capture the curvature that the loess model does, especially in the older ages.¶

Similar to the results Nolan and Lang found for men, the hinges of the piecewise model for women make it clear that the fastest runners are between the ages of 20 and 30 with run time increasing as the runner approaches 30. The run times between 30 and 40 decrease slightly, then increase greatly between 40 and 50, and even more sharply after 50.¶

Back to Top

Conclusion¶

Working with data from the websites can be bountiful, but also difficult. The challenges of navigating the often uneven format of many websites can be a real obstacle to an analyst. The Nolan and Lang text demonstrates a useful method for navigating the multiple years of race data available at the Cherry Blossom Race website in an easy to follow and easily modifiable format. Like the men’s data, the women’s data needed to be carefully extracted and re-formatted to be useful. Over the fourteen years of race results, there were multiple formats for recording the runner's data, several spacing errors and inconsistent variable names. Once these were overcome, useful analysis could be made on the relationship between run time and age for female runners. And this analysis is just scratching the surface, much more information could be gleaned from the Cherry Blossom Race results in the future.¶

References¶

Back to Top

Nolan, Deborah. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) (p. 101). CRC Press. Kindle Edition.
https://blogs.sas.com/content/iml/2016/10/17/what-is-loess-regression.html

Quantifying the World¶

Case Study 4 - Cherry Blossom Race¶

Question 7¶

Stacy Conant¶

Feb 10, 2020¶

¶

Contents¶

Introduction¶

Data¶

Method¶

Webscraping¶

Next, the extracResTable() function is applied to each of the women’s URLs with mapply(). The count of rows for each year is also output. Except for 2000, there is a steady increase in the number of female runners each year.¶

Next, a file directory is created to house a text file for each year of women's race results. The write() function writes each year.¶

Visual inspection of the text files shows that 2001 does not have a header for the columns. The format of the 2002 file is correct and contains the same columns, so its header can be applied to 2001.¶

Next, it is a good idea to take a look at the first few rows of data to get better understanding of the formatting. Below, is the example from 2010, but each year has been reviewed. Some years include different columns for race times (net time vs time), or the addition of gun time and splits.¶

Next, the extractVariables() function is used to select the variables of interest, which may vary from year to year, as seen from earlier visual inspection.¶

Cleaning Age¶

Now that the data is extracted, one can begin looking at variables of interest. To assess runner age, it must first be converted from a character data type to numeric. Then it can be displayed in a boxplot by year (Fig. 1).¶

It is obvious that there an issue with the ages in 2003. The ages are skewed very young. Upon visual inspection, it is clear that the age values are shifted to the right one space in comparison to the location of the ‘=’ characters, so it is only reading one digit of the two digit age.¶

To fix this issue, modifications are made to the selectCols() function that locates the end of a column to include the blank position.¶

The boxplot of female runner age is then re-run to check that the modifications applied have been successful (Fig.2).¶

But it seems that in 2001, there was one runner whose age was reported as zero. Which() is used to find the runner with age equal to zero. She is found to be runner in finishing place 2611. To locate her row, three is added to the place to account for the header.¶

The age for Loretta Cuce is manually imputed in to the 2001.txt file and code is re-run.¶

Loretta Cuce's age is now correctly reported as 55. Re-running boxplot code to check that the 0 age has been corrected (Fig. 3).¶

Cleaning Run Time¶

Then we create a function, createDF, that determines which time to use (net, gun, or time), handles the footnote symbols (# and * ) that are in some time records, drops records of runners who did not finish the race, and exports the results to a data frame.¶

After that data has been examined, cleaned and formatted, all years can be saved to cbWomen using do.call() function to call rbind() with the list of data frames as input.¶

Analysis¶

Now it is possible to study the relationship between age and run time for women racers between 1999 and 2012 by running a scatter plot for age vs run time (Fig. 4).¶

To better see and understand the relationship between age and run time, the color brewer library is loaded and a smoothscatter() plot is displayed (Fig. 5).¶

There is a age concentration of runners 20 - 30 years old and with run times concentrated between 80 to 120 minutes.¶

Further insight can be gleaned from a count of runners by age (fig. 6) and a breakout of run times by age category (fig. 7).¶

The majority of female runners are between 25 and 35 years old.¶

It appears that the fastest runner were between 15 and 25 years old.¶

Next, a linear model can be run to attempt to capture the relationship between female age and run time. The lm() function is used to fit run time as a function of age.¶

To find investigate the relationship, loess() is used to find a pattern in the residuals by taking in to account local weighted averages of age. LOESS can be useful because it employs a parametric curve to visualize trends in data that might be difficult to model linearly.¶

A vector of the age values from 20 to 80 years is then created and predicted values are obtained from running predict() on the loess fit.¶

The curve upward in residuals starting at about age 50 tell us a that simple linear model tends to underestimate the run time for older women.¶

Loess is used again to try to achieve a better fit, but not on the residuals this time. The loess model takes locally weighted averages of time as the ages varies from 20 to 80 to produce a non-parametric smooth curve.¶

A piecewise linear model is also fit. Instead of a curved line, the piecewise model will consist of linear segments between hinge points at age 30, 40, 50 and 60. Following Nolan and Lang's example with the mens data, variables for year groups are created (over30, over40, over50, etc.)¶

Next the loess curve and piecewise models are plotted together for comparison.¶

The two curves follow each other quite closely, though the linear model can not capture the curvature that the loess model does, especially in the older ages.¶

Conclusion¶

References¶

Next, the `extracResTable()` function is applied to each of the women’s URLs with `mapply()`. The count of rows for each year is also output. Except for 2000, there is a steady increase in the number of female runners each year.¶

Next, a file directory is created to house a text file for each year of women's race results. The `write()` function writes each year.¶

To fix this issue, modifications are made to the `selectCols()` function that locates the end of a column to include the blank position.¶

But it seems that in 2001, there was one runner whose age was reported as zero. `Which()` is used to find the runner with age equal to zero. She is found to be runner in finishing place 2611. To locate her row, three is added to the place to account for the header.¶

Then we create a function, `createDF`, that determines which time to use (net, gun, or time), handles the footnote symbols (# and * ) that are in some time records, drops records of runners who did not finish the race, and exports the results to a data frame.¶

After that data has been examined, cleaned and formatted, all years can be saved to `cbWomen` using `do.call()` function to call `rbind()` with the list of data frames as input.¶

To better see and understand the relationship between age and run time, the color brewer library is loaded and a `smoothscatter()` plot is displayed (Fig. 5).¶

Next, a linear model can be run to attempt to capture the relationship between female age and run time. The `lm()` function is used to fit run time as a function of age.¶

To find investigate the relationship, `loess()` is used to find a pattern in the residuals by taking in to account local weighted averages of age. LOESS can be useful because it employs a parametric curve to visualize trends in data that might be difficult to model linearly.¶

A vector of the age values from 20 to 80 years is then created and predicted values are obtained from running `predict()` on the loess fit.¶