Quantifying the World¶

Case Study 8 - Time Series - Stock Data¶

Stacy Conant¶

March 2, 2020¶

¶

Contents¶

Introduction
Data
Method
Conclusions
References

Introduction¶

ARIMA, or ‘AutoRegressive Integrated Moving Average’, is a forecasting algorithm based on the idea that information from the past values of the time series can alone be used to predict the future values. Economic data, such as stock price, is notoriously tricky to analyze and forecast as there are many factors that can influence the economy. In this case study, two years of British Petroleum (BP) closing stock price data will be analyzed using ARIMA rules and AIC to decide the best fitting model. Then the analysis will be repeated using a grid search. The forecasts from each model will then be compared using the average square error (ASE). R will be used for this analysis, especially the tswge and fpp2 packages.

Back to Top

Data¶

#load necessary librarys
library(quantmod)
options("getSymbols.warning4.0"=FALSE)
library(tswge)
library(ggplot2)
options(warn=-1)
library(tseries)
library(fpp2)

To pull the data, the `getSymbols` function from the `quantmod` package is used. For this analysis, the closing stock prices for BP are used. The time range is set for two years, February 2018 to Feb 2020.¶

#quantmod package - pull data
getSymbols("BP",from = "2018-2-1", to = "2020-2-3", auto.assign=TRUE)

#Summary of data
summary(BP)

     Index               BP.Open         BP.High          BP.Low     
 Min.   :2018-02-01   Min.   :35.92   Min.   :36.09   Min.   :35.73  
 1st Qu.:2018-08-01   1st Qu.:39.05   1st Qu.:39.23   1st Qu.:38.70  
 Median :2019-02-01   Median :41.28   Median :41.47   Median :40.96  
 Mean   :2019-01-31   Mean   :41.38   Mean   :41.61   Mean   :41.08  
 3rd Qu.:2019-08-01   3rd Qu.:43.71   3rd Qu.:43.99   3rd Qu.:43.49  
 Max.   :2020-01-31   Max.   :47.38   Max.   :47.83   Max.   :47.38  
    BP.Close       BP.Volume         BP.Adjusted   
 Min.   :36.05   Min.   : 2348400   Min.   :33.88  
 1st Qu.:38.95   1st Qu.: 4712150   1st Qu.:36.81  
 Median :41.22   Median : 5975500   Median :38.47  
 Mean   :41.33   Mean   : 6525950   Mean   :38.45  
 3rd Qu.:43.80   3rd Qu.: 7845700   3rd Qu.:40.03  
 Max.   :47.79   Max.   :19987200   Max.   :42.95

#get number of values/days
close = BP$BP.Close
nrow(close)

The data pulled by `quantmod` includes open, high, low, and adjusted prices, but for this study only the closing price will be analyzed. The mean closing price is \$41.33 and there are 503 observations.¶

Back to Top

Method¶

To analyze this stock data, the time series should be stationary. Stationarity has three conditions:

Subpopulations of $𝑋_𝑡$ have the same mean for each $t$. Restated, the mean does not depend on time ($t$).
Subpopulations of $X$ for a given time have a finite and constant variance for all t. Restated, the variance does not depend on time.
The correlation of $𝑋_𝑡1$ and $𝑋_𝑡2$ depends only on $𝑡_2− 𝑡_1$. That is, the correlation between data points is dependent only on how far apart they are, not where they are.

Stationary data should be a "flat looking" series, without trend, with constant variance over time, a constant autocorrelation structure over time and no periodic seasonality. There are tools in R that can assist in assessing the stationarity of a time series.

For this analysis, the R package tswge will be primarily used to explore and model the data. First, the BP closing price time series will be plotted with plotts.wge().

#plot time series
plotts.wge(close)
title("Plot of Daily Closing Price for British Petroleum, Feb 2018 - Feb 2020")

Figure 1: Plot of Daily Closing Price for British Petroleum. It appears to have a wandering pattern with a few larger jumps in price.¶

Next, looking at the spectral density can give clues about the frequencies that may be present in this time series. The parzen.wge() function calculates and plots the smoothed periodogram using the Parzen window. Before the plot, it outputs the frequencies at which the smoothed periodogram is calculated (freq) and the smoothed periodogram point using the Parzen window (pzgram).

parzen.wge(close, trunc=100)

Figure 2: Parzen plot of Spectral Density. ¶

The highest frequency in the Parzen Window is at 0, which indicates that this is indeed a wandering time series with no decernible seasonality or trend.

The next task is to check the ACF and PACF. The ACF, or auto-correlation function, gives values of auto-correlation of any series with its lagged values. It is a graphic representation that helps to describe how well the present value of the series is related with its past values.

The PACF, or partial auto-correlation function, finds the correlation of the residuals with the next lag value, hence ‘partial’, and not ‘complete’ as it removes already found variations before it find the next correlation.

acf(close)

Figure 3: ACF for Daily BP Closing Stock Price. ¶

This ACF is slowly dampening to zero, even out past 25 lags. This could indicate that future values of the series are correlated / heavily affected by past values and that the stock prices are not stationary.

Altering the lag max in the function to increase the plotted lags displays where the lags begin approach zero.

acf(close, lag.max = 105)

Figure 4: ACF to lag 105 for Daily BP Closing Stock Price. ¶

There is good positive correlation with the lags up to about lag 80, this is the point where ACF plot cuts the upper confidence threshold.

pacf(close)

Figure 5: PACF for Daily BP Closing Stock Price. ¶

The PACF displays the lag crossing the 95% limit line and approaching zero at lag 2. This could indicate an AR(2) process in this data.

A test that can be conducted to look for stationarity is the Augmented Dickey–Fuller (ADF) t-statistic test to find if the series has a unit root (a series with a trend line, or non-stationary, will have a unit root and result in a large p-value). The adt.test() function from the tseries package will be used.

#test for stationarity
adf.test(close)

	Augmented Dickey-Fuller Test

data:  close
Dickey-Fuller = -2.694, Lag order = 7, p-value = 0.2846
alternative hypothesis: stationary

The Dickey-Fuller for stationarity fails to reject the null and indicates that the BP stock data is not stationary with a p-value of 0.2846.

Back to Top

Differenced Data¶

To attempt to stationarize the data, one difference can be taken. Differencing can help stabilise the mean of a time series by computing the differences between consecutive observations to eliminate or reduce any trend and seasonality. To do this, tswge has the artrans.wge() function.

#take 1 difference with artrans
close.dif = artrans.wge(close, 1)

Figure 6: Plots of orginal time series and ACF (top) and differenced time series and ACF (bottom). ¶

The transformed times series certainly appears more stationary. The ACF for the transformed data now appears as a white noise series. Another Dickey-Fuller test can assess the stationary of this difference data.

#stationarity test of differenced data
adf.test(close.dif)

	Augmented Dickey-Fuller Test

data:  close.dif
Dickey-Fuller = -7.9607, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

The test for stationarity on the differenced data rejects the null and indicates that the BP stock data is stationary with a p-value of 0.01.

Another test for stationarity is the Jlung-Box test for independence. The Ljung-Box test examines whether there is significant evidence for non-zero correlations at given lags, with the null hypothesis of independence in a given time series (a non-stationary signal will have a low p-value). The jlung.wge() function of tswge is used for this test.

ljung.wge(close.dif)

Obs -0.07141393 -0.008923918 0.04868184 0.05578057 -0.08628334 0.02453343 -0.01434213 0.002765201 -0.008621435 -0.006807761 0.05791577 -0.01510345 -0.04470057 0.04159326 0.002429646 0.02486328 -0.01619557 0.006486719 -0.02650535 0.01441494 0.01414213 0.03533125 -0.02384849 0.01602258

The p-value of the test on the differenced time series has a large p-value indicating that the time series is stationary.

Back to Top

Manual Model Search¶

Next, the ACF and PACF of the differenced data can be assessed to attempt to determine the $p$ and $q$ terms for the ARIMA model. The acf() and pacf() functions are used for this.

#plot acf of differenced data
acf(close.dif)

Figure 7: Plot of the differenced time series ACF. ¶

The ACF can help to determine the value of $q$ for the MA process. All of the lags are within the 95% confidence intervals. This could mean that 0 may be an apropriate MA term, but an MA term of 1 or 2 is also plausible. Rule 7 indicates that a negative lag 1 in the ACF, as shown in Fig. 7, could mean that the series is slightly "overdifferenced" and an MA term should be added to the model.

#plot pacf of differenced data
pacf(close.dif)

Figure 8: Plot of the differenced time series PACF. ¶

The PACF can be assessed for the AR term, or $p$. Here again, all the of the lags are within the 95% limit lines. Conservatively, 0 could be used as the AR term, but rule 6 indicated that 1 or 2 may be more likely to give a better model.

Several models can be tried and judged in terms of AIC score. The Akaike Information Criterion (AIC) is an estimator of overall model quality and widely used for model selection. It measures the relative information loss by a given model. Thus, less information loss by a model, better the model quality.

$AIC$ = 2$\textit{k}$ - 2ln$({\widehat{L}})$

The function est.arma.wge() is used to calculate maximum likelihood estimates of parameters of stationary models. The $p$ and $q$ terms are plugged in to the function along with the differenced data. The AIC can be printed for the model.

#model 1 - AR(2) and MA(2)
dif.est1 = est.arma.wge(close.dif, p=2, q=2)
#print AIC
dif.est1$aic

Coefficients of Original polynomial:  
-0.1317 -0.8183 

Factor                 Roots                Abs Recip    System Freq 
1+0.1317B+0.8183B^2   -0.0805+-1.1025i      0.9046       0.2616

#model 2 - AR(2) and MA(1)
dif.est2 = est.arma.wge(close.dif, p=2, q=1)
#print AIC
dif.est2$aic

Coefficients of Original polynomial:  
-0.3278 -0.0411 

Factor                 Roots                Abs Recip    System Freq 
1+0.3278B+0.0411B^2   -3.9830+-2.9051i      0.2028       0.3997

#model 3 - AR(1) and MA(2)
dif.est3 = est.arma.wge(close.dif, p=1, q=2)
#print AIC
dif.est3$aic

Coefficients of Original polynomial:  
0.5969 

Factor                 Roots                Abs Recip    System Freq 
1-0.5969B              1.6753               0.5969       0.0000

The lowest AIC is the first model, an ARIMA(2,1,2).¶

To fuller assess this model's appropriateness, the ACF of the residuals of the estimate can be viewed and tested for whiteness using the Jlung-Box test again.

#print pval of the Jlung-Box test of the residuals
ljung.wge(dif.est1$res)$pval
#view ACF
acf(dif.est1$res)

Obs -0.001238373 0.02357949 -0.01260141 0.03958959 -0.03265358 0.02737353 -0.0591823 0.0009563824 0.02623148 -0.004441566 0.02961887 -0.01142664 -0.0234686 0.03268989 -0.009071479 0.03756145 -0.004747382 -0.008058259 -0.03768635 0.0256482 0.02613063 0.03129668 -0.0295573 0.01725465

Figure 9: Plot of the residuals of the ARIMA(2,1,2) model. ¶

The ACF of the residuals show the lags staying completely withing the 95% limit lines and the Jlung-Box test returns a large p-value. Both of these indicate that the model does not exhibit any significant lack of fit and forecasting could be performed.

Calling dif.est1 displays the data output by the estimate, including phi and theta terms.

#display estimate data
dif.est1

Next, the estimates from model 1, (ARIMA(2,1,2), are used to create a forecast. In this case, the phi, theta and difference ($d$) term are included. The forecast will be for 30 days and will begin from 30 days before the end of the series so that comparisons can be made against the actual values.¶

#forecast in tswge for model 1 - ARIMA(2,1,2)
dif.fore1 = fore.aruma.wge(close, phi = dif.est1$phi, theta = dif.est1$theta, d = 1, n.ahead = 30, lastn = T, limits = T)
title("Time Series & Forecast for Last 30 Days of BP Closing with ARIMA (2,1,2)")

Figure 10: Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,2). ¶

The forecast does not do a great job of forecasting against the actual closing prices¶

#Display of forecasts
dif.fore1$f

#close up the forecast
plot(seq(474,503,1), close[474:503],type = 'l', xlim = c(474,503))
lines(seq(474,503), dif.fore1$f, col = 'blue')
title("Plot of Forecasts for Last 30 Days of BP Closing with ARIMA (2,1,2)")

Figure 11: Close up Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,2). The forecast does not do a great job of forecasting against the actual closing prices¶

Forecasting errors can be evaluated in terms of Average Square Error (ASE). Lower ASE models are preferred.

$ASE = mean((Forecasted Value - Observed Actual Value)^2)$

#calculate ASE for model 1 - ARIMA(2,1,2)
ASE1 = mean((dif.fore1$f - close[(length(close) - 29):length(close)])^2)
ASE1

Back to Top

Grid Search¶

Next we can utilize the a grid search function to attempt to find the best fitting model for the BP stock data.

The auto.arima() function in the R package fpp2 uses a variation of the Hyndman-Khandakar algorithm, which combines unit root tests, minimization of the AICs and MLE to obtain an ARIMA model.

The number of differences $0≤d≤2$ is determined using repeated KPSS tests.
The values of $p$ and $q$ are then chosen by minimising the AICc after differencing the data $d$ times. Rather than considering every possible combination of $p$ and $q$, the algorithm uses a stepwise search to traverse the model space.

a. Four initial models are fitted:
- ARIMA(0,$d$,0),
- ARIMA(2,$d$,2),
- ARIMA(1,$d$,1),
- ARIMA(0,$d$,1).
  
  A constant is included unless $d = 2$. If $d \leq 1$, an additional model is also fitted: ARIMA(0,$d$,0) without a constant.
b. The best model (with the smallest AICc value) fitted in step (a) is set to be the “current model”.

c. Variations on the current model are considered:
- vary $p$ and/or $q$ from the current model by $±$ 1;
- include/exclude $c$ from the current model. The best model considered so far (either the current model or one of these variations) becomes the new current model.
d. Repeat Step 2(c) until no lower AICc can be found.

For this grid search, the default maximum values of $p$, $q$ and $d$ will be increased so that more models are searched.

#grid search function of fpp2 package
auto.arima(close.dif, max.p = 8, max.q = 5, max.d=3)

Series: close.dif 
ARIMA(2,0,1) with zero mean 

Coefficients:
          ar1      ar2     ma1
      -0.3148  -0.0400  0.2388
s.e.   0.5719   0.0595  0.5705

sigma^2 estimated as 0.2785:  log likelihood=-389.94
AIC=787.89   AICc=787.97   BIC=804.76

The grid search suggests an ARIMA(2,0,1) on the differenced data, which is also a ARIMA(2,1,1) on the original data.¶

Next, an estimate will be created using the suggested model, as it was on the manually searched model 1.¶

#create estimate of suggested ARIMA(2,1,1) model
dif.est4 = est.arma.wge(close.dif, p=2, q=1)

Coefficients of Original polynomial:  
-0.3278 -0.0411 

Factor                 Roots                Abs Recip    System Freq 
1+0.3278B+0.0411B^2   -3.9830+-2.9051i      0.2028       0.3997

#forecasting based on suggested grid search model ARIMA(2,1,1)
dif.fore4 = fore.aruma.wge(close, phi = dif.est4$phi, d = 1, n.ahead = 30, lastn = T, limits = T)
title("Time Series and Forecast for Last 30 Days of BP Closing with ARIMA (2,1,1)")

Figure 12: Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,1) as suggeested by grid search. The forecast does not do a great job of forecasting against the actual closing prices¶

#close up the forecast
plot(seq(474,503,1), close[474:503],type = 'l', xlim = c(474,503))
lines(seq(474,503), dif.fore4$f, col = 'blue')
title("Plot of Forecasts for Last 30 Days of BP Closing with ARIMA (2,1,1)")

Figure 13: Close up Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,1). The forecast does not do a great job of forecasting against the actual closing prices¶

#average square error for (2,1,1)
ASE4= mean((dif.fore4$f - close[(length(close) - 29):length(close)])^2)
ASE4

Using `auto.arima()` a plot of the future $n$ values can also be easily created.¶

#plot of future 30 days
plot(forecast(auto.arima(close, max.p = 8, max.q = 5, max.d=3),h=30))

Figure 14: Plot of Forecasts for Next 30 Days of BP Closing Stock Price with ARIMA (2,1,1). The forecast only seems to repeat the same value.¶

Summary of manual model search and grid search with `auto.arima()`.¶

The ASE for the grid searched ARIMA(2,1,1) is slightly higher indicating that the first model tried from the manual search, ARIMA(2,1,2), was a slightly better fit for the BP stock data.

Manual Selection	Grid Search
ARIMA(2,1,2)	ARIMA(2,1,1)
ASE = 0.8951	ASE = 0.9671

¶

Back to Top

Conclusion¶

Analyzing stock data is no easy task. The ARIMA models tried, both manual and grid searched, in this case study did not seem to do an adequate job of forecasting the actual values of the BP clsoing prices. This may be because of an error in the modelling, such as a seasonal term that was not accounted for. Or it indicates that economics is a complicated field with many other variables at work such as political environment, interest rates, and supply and demand. In the future, a multivariate model could be used that might better estimate the daily closing prices.

Back to Top

Quantifying the World¶

Case Study 8 - Time Series - Stock Data¶

Stacy Conant¶

March 2, 2020¶

¶

Contents¶

Introduction¶

Data¶

To pull the data, the getSymbols function from the quantmod package is used. For this analysis, the closing stock prices for BP are used. The time range is set for two years, February 2018 to Feb 2020.¶

The data pulled by quantmod includes open, high, low, and adjusted prices, but for this study only the closing price will be analyzed. The mean closing price is \$41.33 and there are 503 observations.¶

Method¶

Figure 1: Plot of Daily Closing Price for British Petroleum. It appears to have a wandering pattern with a few larger jumps in price.¶

Figure 2: Parzen plot of Spectral Density. ¶

Figure 3: ACF for Daily BP Closing Stock Price. ¶

Figure 4: ACF to lag 105 for Daily BP Closing Stock Price. ¶

Figure 5: PACF for Daily BP Closing Stock Price. ¶

Differenced Data¶

Figure 6: Plots of orginal time series and ACF (top) and differenced time series and ACF (bottom). ¶

Manual Model Search¶

Figure 7: Plot of the differenced time series ACF. ¶

Figure 8: Plot of the differenced time series PACF. ¶

The lowest AIC is the first model, an ARIMA(2,1,2).¶

Figure 9: Plot of the residuals of the ARIMA(2,1,2) model. ¶

Next, the estimates from model 1, (ARIMA(2,1,2), are used to create a forecast. In this case, the phi, theta and difference ($d$) term are included. The forecast will be for 30 days and will begin from 30 days before the end of the series so that comparisons can be made against the actual values.¶

Figure 10: Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,2). ¶

The forecast does not do a great job of forecasting against the actual closing prices¶

Figure 11: Close up Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,2). The forecast does not do a great job of forecasting against the actual closing prices¶

Grid Search¶

The grid search suggests an ARIMA(2,0,1) on the differenced data, which is also a ARIMA(2,1,1) on the original data.¶

Next, an estimate will be created using the suggested model, as it was on the manually searched model 1.¶

Figure 12: Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,1) as suggeested by grid search. The forecast does not do a great job of forecasting against the actual closing prices¶

Figure 13: Close up Plot of Forecasts for Last 30 Days of BP Closing Stock Price with ARIMA (2,1,1). The forecast does not do a great job of forecasting against the actual closing prices¶

Using auto.arima() a plot of the future $n$ values can also be easily created.¶

Figure 14: Plot of Forecasts for Next 30 Days of BP Closing Stock Price with ARIMA (2,1,1). The forecast only seems to repeat the same value.¶

Summary of manual model search and grid search with auto.arima().¶

¶

Conclusion¶

References¶

To pull the data, the `getSymbols` function from the `quantmod` package is used. For this analysis, the closing stock prices for BP are used. The time range is set for two years, February 2018 to Feb 2020.¶

The data pulled by `quantmod` includes open, high, low, and adjusted prices, but for this study only the closing price will be analyzed. The mean closing price is \$41.33 and there are 503 observations.¶

Using `auto.arima()` a plot of the future $n$ values can also be easily created.¶

Summary of manual model search and grid search with `auto.arima()`.¶