Abstract
In this project, we compare the performance of two portfolios, both with an initial capital of $1,000,000: one is managed by monkeys, the other by humans. The former portfolio is constructed by randomized stock picks. The latter portfolio is constructed using a multi-factor valuation model adhering to Arbitrage Pricing Theory with a basic allocation model. Performance is assessed in terms of portfolio value at the end of the testing period.Data provided for free by IEX. View IEX’s Terms of Use.
Note that this service will be deprecated on June 1, 2019. To use a similar service, see IEX Cloud. Furthermore, consult its terms of service.
The objective of this tutorial is to provide a motivated project as an example of using R in a realistic application. This project seeks to answer a simple question: Who would be better at picking stocks: Monkeys or Humans? Is the stock market a game of skill or luck? Let’s find out!
Important Note: Consult the glossary for detailed information on most financial terms used in this project.
Our hypothesis is:
Humans outperform monkeys in portfolio management.
At first glance, our experiment appears to be nothing more than an attempt to answer a juvenile curiosity of little practical importance. Its actual implications, however, are quite significant.
The actual question is: does the application of quantitative financial theory improve the ability of portfolio managers to gain superior returns? If humans succeed over the monkeys, the answer is affirmative. The “monkey” portfolio is merely a proxy for a randomly constructed one. The specific financial theory to be tested is Arbitrage Pricing Theory, which makes use of multilinear regression to predict equity returns.
Yet the question remains: Is there any point in applying sophisticated data science to merely get money? From a business standpoint, the answer is clearly a “yes.” Asset management firms, be it for high-net worth clients or pension funds, will be able to increase economic efficiency by supplanting the work of many financial analysts with a few data scientists, thereby increasing employee and client returns. From a data-science standpoint, financial models returns can have their fundamental theory applied to other cases concerning stochastic forecasting, such as fluid dynamics, tumor growth etc. Indeed, physical models of Brownian motion have found use in finance.
This experiment is divided into three stages, which will be presented sequentially:
Note: A trading week is not 7 days (traders need weekends off too)! A trading week is typically 5 days. We extrapolate the number of trading days in a month and year similarly from average metrics.
As stated previously, the “monkey” portfolio is merely a proxy term for a randomly-constructed portfolio devoid of financial theory. The human portfolio applies Arbitrage Pricing Theory to model daily stock returns as a multilinear regression model.
Though there exist many metrics for assessing portfolio performance, we use total portfolio value for simplicity of computation and brevity of explanation.
Before we begin the construction of either portfolio, let’s load the libraries:
library(jsonlite)
library(httr)
library(tidyverse)
library(dplyr)
library(zoo)
library(RcppRoll)
library(tibble)
theme_set(theme_minimal())
We will be getting the stock data that we need using the Investors Exchange (IEX) API. You can read more about this dataset here: https://iextrading.com/developer/docs/ Let’s get stock data over the last 2 years for Apple to start.
#collect data
r <- GET("https://api.iextrading.com/1.0/stock/market/batch",
query = list(symbols = "aapl",
types = "chart",
range = "2y", last = "5")
)
To get the data, we will be using an HTTP request that will return data in a JSON format. To learn more about JSON, check out the wikipedia article here: https://en.wikipedia.org/wiki/JSON
Congrats! We have some data. However, as you may notice, there isn’t any way to conveniently use this data. We want to get the data into a more usable format. Let’s start by unpacking the JSON formatted data into a more readable format:
#upack data
r1 <- content(r)
rappl <- enframe(unlist(r1$AAPL$chart))
#pull out relevant data
rappl_dates <- filter(rappl, name=="date")
rappl_close <- filter(rappl, name=="close")
#put data into a single dataframe
df <- data.frame(rappl_dates[2], rappl_close[2])
df <- df %>%
dplyr::rename(date="value", aapl_close="value.1")
head(df)
## date aapl_close
## 1 2017-05-22 149.3592
## 2 2017-05-23 149.1749
## 3 2017-05-24 148.7288
## 4 2017-05-25 149.2428
## 5 2017-05-26 148.9906
## 6 2017-05-30 149.0488
You should get a dataframe that looks similar to this. Going forward we will need to do this process for multiple stocks and get more attributes for each, so lets put all of this into a function which we can call repeatedly.
#function to return dataframe of date, close, high, low, volume for ticker
iex_chart <- function(tick, range){
r <- GET("https://api.iextrading.com/1.0/stock/market/batch",
query = list(symbols = tick,
types = "chart",
range = range, last = "5")
)
r1 <- content(r)
rtick <- enframe(unlist(r1[[tick]]$chart))
rdates <- filter(rtick, name=="date")
rclose <- filter(rtick, name=="close")
rhigh <- filter(rtick, name=="high")
rlow <- filter(rtick, name=="low")
rvolume <- filter(rtick, name=="volume")
df <- data.frame(rdates[2], rclose[2], rhigh[2], rlow[2], rvolume[2])
df <- df %>%
dplyr::rename(date="value", close="value.1", high="value.2", low="value.3", volume="value.4")
return(df)
}
#this funciton can be called with just the stock ticker and the timeframe
head(iex_chart("AAPL","2y"))
## date close high low volume
## 1 2017-05-22 149.3592 149.9315 148.3117 22966437
## 2 2017-05-23 149.1749 150.2418 148.6997 19918871
## 3 2017-05-24 148.7288 149.5338 148.0789 19219154
## 4 2017-05-25 149.2428 149.7084 148.4281 19235598
## 5 2017-05-26 148.9906 149.6017 148.6997 21927637
## 6 2017-05-30 149.0488 149.786 148.7191 20126851
Now time to set up our experiment. We will need a dataframe over which our monkeys can make random stock picks and see how their investments pan out.
First, let’s get a bunch of data. We used the method above to get data for 6 stocks over 5 years and saved it as a CSV earlier, so now lets just read that data in. We are going to put all this data into a single dataframe, so let’s also rename the columns so we can tell the data apart.
#Prepare monkey data
aapl_df <- read_csv("./data/aapl.csv") %>%
dplyr::rename(aapl_close="close", aapl_high="high", aapl_low="low",
aapl_vol="volume", aapl_change="change")
cone_df <- read_csv("./data/cone.csv") %>%
dplyr::rename(cone_close="close", cone_high="high", cone_low="low",
cone_vol="volume", cone_change="change")
fb_df <- read_csv("./data/fb.csv") %>%
dplyr::rename(fb_close="close", fb_high="high", fb_low="low",
fb_vol="volume", fb_change="change")
goog_df <- read_csv("./data/goog.csv") %>%
dplyr::rename(goog_close="close", goog_high="high", goog_low="low",
goog_vol="volume", goog_change="change")
govt_df <- read_csv("./data/govt.csv") %>%
dplyr::rename(govt_close="close", govt_high="high", govt_low="low",
govt_vol="volume", govt_change="change")
vz_df <- read_csv("./data/vz.csv") %>%
dplyr::rename(vz_close="close", vz_high="high", vz_low="low",
vz_vol="volume", vz_change="change")
#join into one data frame
df <- plyr::join_all(list(aapl_df,cone_df,fb_df,
goog_df,govt_df,vz_df),
by='date', type='left')
head(df)
## date aapl_close aapl_high aapl_low aapl_vol aapl_change cone_close
## 1 2014-05-12 77.6204 77.7290 76.9094 53324677 0.954164 18.4400
## 2 2014-05-13 77.7421 77.8443 77.3415 39934594 0.121767 18.5244
## 3 2014-05-14 77.7565 78.2187 77.4777 41600846 0.014404 18.4907
## 4 2014-05-15 77.0953 78.1140 76.9932 57711731 -0.661206 18.4653
## 5 2014-05-16 78.2331 78.2358 76.6475 69091834 1.137800 18.6512
## 6 2014-05-19 79.1601 79.5189 78.2096 79439024 0.926995 18.5413
## cone_high cone_low cone_vol cone_change fb_close fb_high fb_low fb_vol
## 1 18.6680 18.2288 130131 0.211179 59.8300 59.90 57.98 48575487
## 2 18.7103 18.3639 157313 0.084472 59.8300 60.89 59.51 48525476
## 3 18.6765 18.2879 104227 -0.033789 59.2300 60.45 58.95 47428583
## 4 18.7035 18.0852 167331 -0.025342 57.9190 59.38 57.52 56813940
## 5 18.7694 18.1866 239721 0.185834 58.0199 58.45 57.31 47933075
## 6 18.9046 18.4653 106110 -0.109812 59.2100 59.56 57.57 43033925
## fb_change goog_close goog_high goog_low goog_vol goog_change govt_close
## 1 2.5900 529.92 530.19 519.010 1908392 11.19 22.8782
## 2 0.0000 533.09 536.07 529.510 1648907 3.17 22.9059
## 3 -0.6000 526.65 533.00 525.290 1191863 -6.44 23.0077
## 4 -1.3110 519.98 525.87 517.420 1703758 -6.67 22.9893
## 5 0.1009 520.63 521.80 515.440 1481688 0.65 23.0051
## 6 1.1901 528.86 529.78 517.583 1276362 8.23 22.9523
## govt_high govt_low govt_vol govt_change vz_close vz_high vz_low
## 1 23.0727 22.8541 54908 -0.018528 38.3433 38.7175 38.2318
## 2 23.0449 22.8874 26060 0.027787 38.0965 38.4866 37.9293
## 3 23.1097 22.8444 29830 0.101793 38.2239 38.4707 38.0567
## 4 23.1005 22.9615 36025 -0.018429 38.1841 38.4070 38.0885
## 5 23.0355 22.9800 52543 0.015743 39.0678 39.2509 38.6458
## 6 23.1097 22.9523 62702 -0.052792 39.1315 39.4102 39.0599
## vz_vol vz_change
## 1 13582378 -0.254774
## 2 18017013 -0.246811
## 3 13875195 0.127387
## 4 13115429 -0.039807
## 5 29389728 0.883740
## 6 19062600 0.063694
We want to focus on a specific period of time, and to do that we will need to compare dates. Let’s turn our date column entries into date objects so we can compare them to each other easily.
#make date into date object
df <- df %>%
type_convert(col_types = cols(date = col_date(format = "%Y-%m-%d")))
Now monkeys aren’t very smart, so they will only care about the closing price of the stock for a given day. Let’s get rid of the rest of the columns for now.
#select date and close for each stock
df <- select(df,date,aapl_close,cone_close,fb_close,goog_close,govt_close,vz_close)
head(df)
## date aapl_close cone_close fb_close goog_close govt_close vz_close
## 1 2014-05-12 77.6204 18.4400 59.8300 529.92 22.8782 38.3433
## 2 2014-05-13 77.7421 18.5244 59.8300 533.09 22.9059 38.0965
## 3 2014-05-14 77.7565 18.4907 59.2300 526.65 23.0077 38.2239
## 4 2014-05-15 77.0953 18.4653 57.9190 519.98 22.9893 38.1841
## 5 2014-05-16 78.2331 18.6512 58.0199 520.63 23.0051 39.0678
## 6 2014-05-19 79.1601 18.5413 59.2100 528.86 22.9523 39.1315
We now have a dataframe that is ready for the monkey side of the experiment.
The monkeys will buy a new stock every week (5 business days). We can pick every 5th entry using the slice command. Also, we will be testing the monkeys over a specific period, so let’s also filter out any rows that don’t fall between our test period, from 1/03/17 to 05/10/2019.
#keep data for testing period, 1/03/2017 - 05/10/2019
df <- filter(df, as.POSIXct(date) >= as.POSIXct("2017-01-03") &
as.POSIXct(date) <= "2019-05-10")
#take every 5th value to get one per week
df <- slice(df, seq(2, nrow(df), by=5))
We use the function as.POSIXct as a way to easily compare dates. Now give our monkeys some money and see how they do. First, create an empty column which will be the monkeysportfolio, and start them off with a small loan of a million dollars.
#create empty column which will be for portfolio
df$portfolio_val <- NA #blank column
df$portfolio_val[1] <- 1000000 #start off with $1,000,000
We are ready for the monkeys to make their stock picks. The monkeys will operate in the following fashion:
Here is that process in action:
for(i in 2:nrow(df)){
#choose a random stock
pick = floor(runif(1, min=2, max=8))
prev_close = ifelse(pick==2, df$aapl_close[i-1],
ifelse(pick==3, df$cone_close[i-1],
ifelse(pick==4, df$fb_close[i-1],
ifelse(pick==5, df$goog_close[i-1],
ifelse(pick==6, df$govt_close[i-1],
ifelse(pick==7, df$vz_close[i-1], 0))))))
num_shares = df$portfolio_val[i-1]/prev_close
curr_close = ifelse(pick==2, df$aapl_close[i],
ifelse(pick==3, df$cone_close[i],
ifelse(pick==4, df$fb_close[i],
ifelse(pick==5, df$goog_close[i],
ifelse(pick==6, df$govt_close[i],
ifelse(pick==7, df$vz_close[i], 0))))))
df$portfolio_val[i] = num_shares*curr_close
}
#The long if statements are necessary due to the random picks and some quirks
#of the r language. This will work for our small experiment, however this
#workaround would not scale well in a practical version.
We now have our monkey stock performance. We can calculate how well the monkeys did but looking at their net portfolio change for each week, and plot this over time to get an overall sense of their performance.
Note: This data is dynamically generated from our JSON call. Thus, it is execution dependent. Recall that we recorded the performance of our monkey portfolio in a single run by CSV file in order to allow stable comparison with our human portfolio.
monkey_df <- df %>%
select(date, portfolio_val) %>%
mutate(net_portfolio_change = portfolio_val - 1000000)
#plot monkey returns
monkey_df %>%
ggplot(mapping=aes(x=date, y=portfolio_val)) + geom_line()+
labs(title="Portfolio value over test period",
x = "Date",
y = "Portfolio Value (Dollars)") +
theme(plot.title = element_text(hjust = 0.5))
Note: In practice, this was done once so as to minimize the tutorial’s dependence on the time of JSON call.
write_csv(monkey_df, "monkey_df.csv")
Before we can start constructing our portfolio, we need to understand what we’re trying to achieve! To this end, we need to understand the financial theory being applied. Much of this material is adapted from the following, wonderful text:
For the sake of brevity, we provide a high-level overview. If this is still not sufficient, consult the textbook above.
Fundamentally, we wish to train a regression model on training data. This model is then applied to unseen test data in order to forecast the daily returns of equities. This model is then used for investment decisions by, in our case, investing all of our capital in stocks with the greatest expected return.
We use the following regression model:
\[ E(R_i) = R_f + \sum_{k=1}^N \beta_{ik} \big( E(F_k) - R_f \big) + \varepsilon, \] where \(R_i\) is the return of the \(i\)th asset, \(E(F_k)\) is the expected value of the \(k\)th factor, \(R_f\) is the risk-free rate and \(\varepsilon\) is the error term. But why do we use this model? What do these terms mean? The answer lies in Arbitrage Pricing Theory
According to general financial theory, a portfolio is fundamentally composed of risky and riskless assets. An asset is said to be riskless if it has a constant rate of return, the risk-free rate \(R_f\). Conversely, the return of a risky asset is variable.
Of course, no asset is truly riskless. Thus, portfolio managers supplant a risk-free asset with one of little risk, such as 10 year U.S. Treasury notes (T-Bill).
In our human portfolio, we take the iShares Core U.S. Treasury Bond ETF as the riskless asset. Unlike T-Bills, we need not wait until maturation to realize our return, thereby allowing us to define the risk-free rate for shorter time-intervals, namely our weekly holding period.
Risk belongs to one of two categories:
Risk is said to be diversifiable if the strategic allocation of a portfolio into certain financial instruments can lower the level of risk. Thus, systematic risk can be viewed as risk inherent to the market that cannot be eliminated. Unsystematic risk can be viewed as risk inherent to specific assets that can be lowered by adjusting our portfolio’s exposure to said asset.
Arbitrage Pricing Theory (APT) assumes the following Theorem:
For a fixed time, the transaction cost for an asset is fixed.
This implies that the existence of arbitrage (a price differential that thereby allows for risk-less profit) is fleeting and will be eliminated by the market. Thus, risky asset cannot be rendered risk-less. Furthermore, ABT assumes that investors are compensated with superior returns, defined as returns in excess of the risk-free rate, for taking risk that cannot be diversified away: systematic risk. Otherwise, were investors compensated for risk that can be diversified, we would have arbitrage.
Let’s return to our model to bring everything together
Recall the model: \[ E(R_i) = R_f + \sum_{k=1}^N \beta_{ik} \big( E(F_k) - R_f \big) + \varepsilon, \] With our knowledge of ABT, we now see that
Thus, it is well-defined to use the same risk-factors for all of the assets being modeled since they are systematic. But, what are the factors themselves?
APT merely provides a theoretical framework in support of the application of a certain multifactor model. It doesn’t, however, provide the factors. Therefore, we consult the BARRA risk-model handbook for factors. All the factors in the handbook have empirical basis. Thus, we choose factors that are not only feasible to implement, but also draw from many potential sources of systematic risk.
The factors used in our factor model, categorized by area of systematic risk.
For volatility risk, we use HILO, the ratio of the high to low over the past month: \[ HILO=\log \left( \frac{P_H}{P_L} \right) \]
For momentum risk, we use RSTR, the relative strength: \[ RSTR=\sum_{i=1}^{12} \log \left( 1 + r_i \right) - \sum_{i=1}^{12} \log \left( 1 + (r_{f})_i \right) \] where \(r_i\) is the return of the asset in the \(i\)th month. Similarly, \((r_{f})_i\) is the return of the risk-less asset in the \(i\)th month.
For size risk, we use LNCAP, the log of the market cap: \[ LNCAP = \log(\textrm{market cap}) \]
For size non-linearity risk, we use LCAPCB, the cube of the low of the market cap: \[ LCAPCB = \big(\log(\textrm{market cap}) \big)^{1/3} \]
For trading activity risk, we use VLVR, the volume to variance: \[ VLVR = \log \left( \frac{\sum_{i=1}^{12} V_i P_i}{\sigma_{res}} \right) \] where \(V_i\) is the total volume for the \(i\)th month, \(P_i\) is the price at the end of the \(i\)th month, and \(\sigma_{res}\) is the residual standard deviation.
The residual standard deviation is defined as the standard deviation of the residual return over the past year.
For earnings yield risk, we use BTOP, the book-to-price ratio: \[ BTOP = \frac{\textrm{annual book}}{\textrm{market cap}} \]
First, we read the daily price data from CSV files.
Note: the CSV data was obtained by making JSON calls to the IEX API, parsing the data, and writing it to a CSV. This was done in order to ensure our experiment does not depend on the time of execution. Furthermore, the free IEX API will be deprecated on June 1, 2019. See Appendix A for the code.
# Read AAPL data
df_aapl <- read_csv("./data/aapl.csv", col_names=
c("date",
"aapl_close",
"aapl_high",
"aapl_low",
"aapl_vol",
"aapl_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
# Read CONE data
df_cone <- read_csv("./data/cone.csv", col_names=
c("date",
"cone_close",
"cone_high",
"cone_low",
"cone_vol",
"cone_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
# Read FB data
df_fb <- read_csv("./data/fb.csv", col_names=
c("date",
"fb_close",
"fb_high",
"fb_low",
"fb_vol",
"fb_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
# Read GOOG data
df_goog <- read_csv("./data/goog.csv", col_names=
c("date",
"goog_close",
"goog_high",
"goog_low",
"goog_vol",
"goog_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
# Read GOVT data
df_govt <- read_csv("./data/govt.csv", col_names=
c("date",
"govt_close",
"govt_high",
"govt_low",
"govt_vol",
"govt_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
# Read VZ data
df_vz <- read_csv("./data/vz.csv", col_names=
c("date",
"vz_close",
"vz_high",
"vz_low",
"vz_vol",
"vz_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double(),
col_double(),
col_double(),
col_double()), skip=1)
Here is what our data looks like:
The contents of each dataframe, after reading from csv:
head(df_aapl)
## # A tibble: 6 x 6
## date aapl_close aapl_high aapl_low aapl_vol aapl_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 77.6 77.7 76.9 53324677 0.954
## 2 2014-05-13 77.7 77.8 77.3 39934594 0.122
## 3 2014-05-14 77.8 78.2 77.5 41600846 0.0144
## 4 2014-05-15 77.1 78.1 77.0 57711731 -0.661
## 5 2014-05-16 78.2 78.2 76.6 69091834 1.14
## 6 2014-05-19 79.2 79.5 78.2 79439024 0.927
head(df_cone)
## # A tibble: 6 x 6
## date cone_close cone_high cone_low cone_vol cone_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 18.4 18.7 18.2 130131 0.211
## 2 2014-05-13 18.5 18.7 18.4 157313 0.0845
## 3 2014-05-14 18.5 18.7 18.3 104227 -0.0338
## 4 2014-05-15 18.5 18.7 18.1 167331 -0.0253
## 5 2014-05-16 18.7 18.8 18.2 239721 0.186
## 6 2014-05-19 18.5 18.9 18.5 106110 -0.110
head(df_fb)
## # A tibble: 6 x 6
## date fb_close fb_high fb_low fb_vol fb_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 59.8 59.9 58.0 48575487 2.59
## 2 2014-05-13 59.8 60.9 59.5 48525476 0
## 3 2014-05-14 59.2 60.4 59.0 47428583 -0.6
## 4 2014-05-15 57.9 59.4 57.5 56813940 -1.31
## 5 2014-05-16 58.0 58.4 57.3 47933075 0.101
## 6 2014-05-19 59.2 59.6 57.6 43033925 1.19
head(df_goog)
## # A tibble: 6 x 6
## date goog_close goog_high goog_low goog_vol goog_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 530. 530. 519. 1908392 11.2
## 2 2014-05-13 533. 536. 530. 1648907 3.17
## 3 2014-05-14 527. 533 525. 1191863 -6.44
## 4 2014-05-15 520. 526. 517. 1703758 -6.67
## 5 2014-05-16 521. 522. 515. 1481688 0.65
## 6 2014-05-19 529. 530. 518. 1276362 8.23
head(df_govt)
## # A tibble: 6 x 6
## date govt_close govt_high govt_low govt_vol govt_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 22.9 23.1 22.9 54908 -0.0185
## 2 2014-05-13 22.9 23.0 22.9 26060 0.0278
## 3 2014-05-14 23.0 23.1 22.8 29830 0.102
## 4 2014-05-15 23.0 23.1 23.0 36025 -0.0184
## 5 2014-05-16 23.0 23.0 23.0 52543 0.0157
## 6 2014-05-19 23.0 23.1 23.0 62702 -0.0528
head(df_vz)
## # A tibble: 6 x 6
## date vz_close vz_high vz_low vz_vol vz_change
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 38.3 38.7 38.2 13582378 -0.255
## 2 2014-05-13 38.1 38.5 37.9 18017013 -0.247
## 3 2014-05-14 38.2 38.5 38.1 13875195 0.127
## 4 2014-05-15 38.2 38.4 38.1 13115429 -0.0398
## 5 2014-05-16 39.1 39.3 38.6 29389728 0.884
## 6 2014-05-19 39.1 39.4 39.1 19062600 0.0637
Next, we read the balance sheet data for our equities from CSV files. For detailed information on our data source, consult Appendix B.
# Read bsheet data for AAPL
df_aapl_b <- read_csv("./data/AAPL_bsheet.csv", col_names=
c("date",
"aapl_book",
"aapl_outS"), col_types=
list(col_date(format="%m/%d/%Y"),
col_double(),
col_guess()), skip=1)
# Read bsheet data for CONE
df_cone_b <- read_csv("./data/CONE_bsheet.csv", col_names=
c("date",
"cone_book",
"cone_outS"), col_types=
list(col_date(format="%m/%d/%Y"),
col_guess(),
col_guess()), skip=1)
# Read bsheet data for FB
df_fb_b <- read_csv("./data/FB_bsheet.csv", col_names=
c("date",
"fb_book",
"fb_outS"), col_types=
list(col_date(format="%m/%d/%Y"),
col_guess(),
col_guess()), skip=1)
# Read bsheet data for GOOG
df_goog_b <- read_csv("./data/GOOG_bsheet.csv", col_names=
c("date",
"goog_book",
"goog_outS"), col_types=
list(col_date(format="%m/%d/%Y"),
col_guess(),
col_guess()), skip=1)
# Read bsheet data for VZ
df_vz_b <- read_csv("./data/VZ_bsheet.csv", col_names=
c("date",
"vz_book",
"vz_outS"), col_types=
list(col_date(format="%m/%d/%Y"),
col_guess(),
col_guess()), skip=1)
Here is what our data looks like:
The contents of each dataframe, after reading from csv:
head(df_aapl_b)
## # A tibble: 6 x 3
## date aapl_book aapl_outS
## <date> <dbl> <dbl>
## 1 2014-03-29 120179000000 861745000
## 2 2014-06-28 120179000000 5989171000
## 3 2014-09-27 120179000000 5866161000
## 4 2014-12-27 120179000000 5826419000
## 5 2015-03-28 129000000000 5762278000
## 6 2015-06-30 129000000000 5705400000
head(df_cone_b)
## # A tibble: 6 x 3
## date cone_book cone_outS
## <date> <dbl> <dbl>
## 1 2014-03-31 319600000 22116172
## 2 2014-06-30 319600000 38658249
## 3 2014-09-30 319600000 38653771
## 4 2014-12-31 319600000 38651517
## 5 2015-03-31 460700000 39058786
## 6 2015-06-30 460700000 39058786
head(df_fb_b)
## # A tibble: 6 x 3
## date fb_book fb_outS
## <date> <dbl> <dbl>
## 1 2014-03-31 16737000000 1991000000
## 2 2014-06-30 16737000000 2013000000
## 3 2014-09-30 16737000000 2044000000
## 4 2014-12-31 16737000000 2234113007
## 5 2015-03-31 36096000000 2247000000
## 6 2015-06-30 36096000000 2256000000
head(df_goog_b)
## # A tibble: 6 x 3
## date goog_book goog_outS
## <date> <dbl> <dbl>
## 1 2014-03-31 91700000000 337231000
## 2 2014-06-30 91700000000 337966000
## 3 2014-09-30 91700000000 339282000
## 4 2014-12-31 91700000000 340399000
## 5 2015-03-31 105000000000 341652000
## 6 2015-06-30 105000000000 343908000
head(df_vz_b)
## # A tibble: 6 x 3
## date vz_book vz_outS
## <date> <dbl> <dbl>
## 1 2014-03-31 13851000000 3425000000
## 2 2014-06-30 13851000000 4147000000
## 3 2014-09-30 13851000000 4152000000
## 4 2014-12-31 13851000000 3974000000
## 5 2015-03-31 13676000000 4116000000
## 6 2015-06-30 13676000000 4079000000
Now, we prepare our raw data for the calculation of our factors. The contents of the dataframes are shown.
# Joining our data into a single dateframe and adding the earliest book
# and outstanding shares data from the balance sheet dataframes
# For convenience, we also simplify our attribute names.
# Lastly, we sort by ascending date
# Join AAPL
df_aapl <- df_aapl %>%
full_join(df_aapl_b, by="date") %>%
dplyr::rename("close"=2, "high"=3, "low"=4, "vol"=5, "change"=6, "book"=7,
"outS"=8) %>%
arrange(date)
# Join CONE
df_cone <- df_cone %>%
full_join(df_cone_b, by="date") %>%
dplyr::rename("close"=2, "high"=3, "low"=4, "vol"=5, "change"=6, "book"=7,
"outS"=8) %>%
arrange(date)
# Join FB
df_fb <- df_fb %>%
full_join(df_fb_b, by="date") %>%
dplyr::rename("close"=2, "high"=3, "low"=4, "vol"=5, "change"=6, "book"=7,
"outS"=8) %>%
arrange(date)
# Join GOOG
df_goog <- df_goog %>%
full_join(df_goog_b, by="date") %>%
dplyr::rename("close"=2, "high"=3, "low"=4, "vol"=5, "change"=6, "book"=7,
"outS"=8) %>%
arrange(date)
# Join VZ
df_vz <- df_vz %>%
full_join(df_vz_b, by="date") %>%
dplyr::rename("close"=2, "high"=3, "low"=4, "vol"=5, "change"=6, "book"=7,
"outS"=8) %>%
arrange(date)
# Remove extraneous variables from our workspace
rm(df_aapl_b, df_cone_b, df_fb_b, df_goog_b, df_vz_b)
# Remove all but the change and date for df_govt since we only care about
# its change
df_govt <- df_govt %>%
select(date, change="govt_change", close="govt_close")
# Replace NA values in the "Book" and "Outstanding Shares" attributes
# with last non-NA value
df_aapl$book <- na.locf(df_aapl$book)
df_aapl$outS <- na.locf(df_aapl$outS)
df_cone$book <- na.locf(df_cone$book)
df_cone$outS <- na.locf(df_cone$outS)
df_fb$book <- na.locf(df_fb$book)
df_fb$outS <- na.locf(df_fb$outS)
df_goog$book <- na.locf(df_goog$book)
df_goog$outS <- na.locf(df_goog$outS)
df_vz$book <- na.locf(df_vz$book)
df_vz$outS <- na.locf(df_vz$outS)
# Since earning reports don't always correspond to trading days, remove
# dates that aren't trading days
df_aapl <- filter(df_aapl, !is.na(close))
df_cone <- filter(df_cone, !is.na(close))
df_fb <- filter(df_fb, !is.na(close))
df_goog <- filter(df_goog, !is.na(close))
df_vz <- filter(df_vz, !is.na(close))
# Calculate market portfolio of returns and date
df_market <- df_aapl %>%
select(date, cl_aapl=close, ch_aapl=change) %>%
right_join(df_cone, by="date") %>%
select(date, cl_aapl, ch_aapl,
cl_cone=close, ch_cone=change) %>%
right_join(df_fb, by="date") %>%
select(date, cl_aapl, ch_aapl,
cl_cone, ch_cone,
cl_fb=close, ch_fb=change) %>%
right_join(df_goog, by="date") %>%
select(date, cl_aapl, ch_aapl,
cl_cone, ch_cone,
cl_fb, ch_fb,
cl_goog=close, ch_goog=change) %>%
right_join(df_vz, by="date") %>%
select(date, cl_aapl, ch_aapl,
cl_cone, ch_cone,
cl_fb, ch_fb,
cl_goog, ch_goog,
cl_vz=close, ch_vz=change)
df_market <- df_market %>%
mutate(cl_market=cl_aapl+cl_cone+cl_fb+cl_goog+cl_vz) %>%
mutate(w_aapl=cl_aapl/cl_market,
w_cone=cl_cone/cl_market,
w_fb=cl_fb/cl_market,
w_goog=cl_goog/cl_market,
w_vz=cl_vz/cl_market) %>%
mutate(ch_market=w_aapl*ch_aapl+
w_cone*ch_cone+
w_fb*ch_fb+
w_goog*ch_goog+
w_vz*ch_vz) %>%
mutate(r_market=ch_market/lag(cl_market, 1)) %>%
select(date, r_market)
# Calculate the daily returns on our risk-free rate (govt)
df_govt <- df_govt %>%
mutate(r=change/lag(close, 1))
# Calculate the return and excess return for each asset,
# including the market portfolio
get_returns <- function(df) {
df <- df %>%
mutate(r=change/lag(close, 1)) %>%
mutate(r_excess = r-df_govt$r)
}
df_aapl <- get_returns(df_aapl)
df_cone <- get_returns(df_cone)
df_fb <- get_returns(df_fb)
df_goog <- get_returns(df_goog)
df_vz <- get_returns(df_vz)
df_market <- df_market %>%
mutate(r_excess_market = r_market-df_govt$r)
The contents of each dataframe, after combining business sheet and daily data.
Our AAPL dataframe:
head(df_aapl)
## # A tibble: 6 x 10
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 77.6 77.7 76.9 5.33e7 0.954 1.20e11 8.62e8 NA
## 2 2014-05-13 77.7 77.8 77.3 3.99e7 0.122 1.20e11 8.62e8 1.57e-3
## 3 2014-05-14 77.8 78.2 77.5 4.16e7 0.0144 1.20e11 8.62e8 1.85e-4
## 4 2014-05-15 77.1 78.1 77.0 5.77e7 -0.661 1.20e11 8.62e8 -8.50e-3
## 5 2014-05-16 78.2 78.2 76.6 6.91e7 1.14 1.20e11 8.62e8 1.48e-2
## 6 2014-05-19 79.2 79.5 78.2 7.94e7 0.927 1.20e11 8.62e8 1.18e-2
## # ... with 1 more variable: r_excess <dbl>
Our CONE dataframe:
head(df_cone)
## # A tibble: 6 x 10
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 18.4 18.7 18.2 130131 0.211 3.20e8 2.21e7 NA
## 2 2014-05-13 18.5 18.7 18.4 157313 0.0845 3.20e8 2.21e7 0.00458
## 3 2014-05-14 18.5 18.7 18.3 104227 -0.0338 3.20e8 2.21e7 -0.00182
## 4 2014-05-15 18.5 18.7 18.1 167331 -0.0253 3.20e8 2.21e7 -0.00137
## 5 2014-05-16 18.7 18.8 18.2 239721 0.186 3.20e8 2.21e7 0.0101
## 6 2014-05-19 18.5 18.9 18.5 106110 -0.110 3.20e8 2.21e7 -0.00589
## # ... with 1 more variable: r_excess <dbl>
Our FB dataframe:
head(df_fb)
## # A tibble: 6 x 10
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 59.8 59.9 58.0 4.86e7 2.59 1.67e10 1.99e9 NA
## 2 2014-05-13 59.8 60.9 59.5 4.85e7 0 1.67e10 1.99e9 0
## 3 2014-05-14 59.2 60.4 59.0 4.74e7 -0.6 1.67e10 1.99e9 -0.0100
## 4 2014-05-15 57.9 59.4 57.5 5.68e7 -1.31 1.67e10 1.99e9 -0.0221
## 5 2014-05-16 58.0 58.4 57.3 4.79e7 0.101 1.67e10 1.99e9 0.00174
## 6 2014-05-19 59.2 59.6 57.6 4.30e7 1.19 1.67e10 1.99e9 0.0205
## # ... with 1 more variable: r_excess <dbl>
Our GOOG dataframe:
head(df_goog)
## # A tibble: 6 x 10
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 530. 530. 519. 1.91e6 11.2 9.17e10 3.37e8 NA
## 2 2014-05-13 533. 536. 530. 1.65e6 3.17 9.17e10 3.37e8 0.00598
## 3 2014-05-14 527. 533 525. 1.19e6 -6.44 9.17e10 3.37e8 -0.0121
## 4 2014-05-15 520. 526. 517. 1.70e6 -6.67 9.17e10 3.37e8 -0.0127
## 5 2014-05-16 521. 522. 515. 1.48e6 0.65 9.17e10 3.37e8 0.00125
## 6 2014-05-19 529. 530. 518. 1.28e6 8.23 9.17e10 3.37e8 0.0158
## # ... with 1 more variable: r_excess <dbl>
Our VZ dataframe:
head(df_vz)
## # A tibble: 6 x 10
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-12 38.3 38.7 38.2 1.36e7 -0.255 1.39e10 3.42e9 NA
## 2 2014-05-13 38.1 38.5 37.9 1.80e7 -0.247 1.39e10 3.42e9 -0.00644
## 3 2014-05-14 38.2 38.5 38.1 1.39e7 0.127 1.39e10 3.42e9 0.00334
## 4 2014-05-15 38.2 38.4 38.1 1.31e7 -0.0398 1.39e10 3.42e9 -0.00104
## 5 2014-05-16 39.1 39.3 38.6 2.94e7 0.884 1.39e10 3.42e9 0.0231
## 6 2014-05-19 39.1 39.4 39.1 1.91e7 0.0637 1.39e10 3.42e9 0.00163
## # ... with 1 more variable: r_excess <dbl>
Now we calculate the factors for each asset.
# Calculate VLVR
# Function that returns a dataframe with the VLVR
get_vlvr <- function(df) {
df <- df %>%
right_join(df_market, by="date") %>%
select(-r_market)
# Run the regression for the beta
df_fit <- lm(r_excess~r_excess_market,
filter(df, date<"2017-01-03"))
# Fit residual return
beta <- ((broom::tidy(df_fit))["estimate"])[[1, 1]]
df <- df %>%
mutate(r_resid=r_excess-(beta*r_excess_market)) %>%
select(-r_excess_market)
# Calculate annual standard deviation of residual returns
# Calculate rolling annual volume and monthly close
# Note that we use a window of 252+1
# Calculate VLVR
df <- df %>%
mutate(r_resid_sd=roll_sdr(r_resid, n=253, fill=NA)) %>%
mutate(VLVR=log10((roll_sumr(vol, 21)*lag(close, 21)+
roll_sumr(lag(vol, 21), 21)*lag(close, 21*2)+
roll_sumr(lag(vol, 21*2), 21)*lag(close, 21*3)+
roll_sumr(lag(vol, 21*3), 21)*lag(close, 21*4)+
roll_sumr(lag(vol, 21*4), 21)*lag(close, 21*5)+
roll_sumr(lag(vol, 21*5), 21)*lag(close, 21*6)+
roll_sumr(lag(vol, 21*6), 21)*lag(close, 21*7)+
roll_sumr(lag(vol, 21*7), 21)*lag(close, 21*8)+
roll_sumr(lag(vol, 21*8), 21)*lag(close, 21*9)+
roll_sumr(lag(vol, 21*9), 21)*lag(close, 21*10)+
roll_sumr(lag(vol, 21*10), 21)*lag(close, 21*11)+
roll_sumr(lag(vol, 21*11), 21)*lag(close, 21*12))/r_resid_sd)) %>%
select(-r_resid, -r_resid_sd)
}
df_aapl <- get_vlvr(df_aapl)
df_cone <- get_vlvr(df_cone)
df_fb <- get_vlvr(df_fb)
df_goog <- get_vlvr(df_goog)
df_vz <- get_vlvr(df_vz)
# Function that returns a dataframe with the HILO
get_hilo <- function(df) {
# Calculate lowest and highest price for past trading month for every
# trading day
df$m_high <- df %>%
select(high) %>%
rollmax(k = 22, na.pad = TRUE, align = "right")
df$m_low <- df %>%
select(low) %>%
mutate(low=-low) %>%
rollmax(k = 22, na.pad = TRUE, align = "right") * -1
# Calculate the HILO
df <- df %>%
mutate(HILO = log10(m_high/m_low)) %>%
select(-m_high, -m_low)
}
df_aapl <- get_hilo(df_aapl)
df_cone <- get_hilo(df_cone)
df_fb <- get_hilo(df_fb)
df_goog <- get_hilo(df_goog)
df_vz <- get_hilo(df_vz)
# Function that returns a dataframe with the RSTR
get_rstr <- function(df) {
df <- df %>%
mutate(RSTR_l=
log10(1+(roll_sumr(change, 21)/lag(close, 21)))+
log10(1+(roll_sumr(lag(change, 21), 21)/lag(close, 21*2)))+
log10(1+(roll_sumr(lag(change, 21), 21*2)/lag(close, 21*3)))+
log10(1+(roll_sumr(lag(change, 21), 21*3)/lag(close, 21*4)))+
log10(1+(roll_sumr(lag(change, 21), 21*4)/lag(close, 21*5)))+
log10(1+(roll_sumr(lag(change, 21), 21*5)/lag(close, 21*6)))+
log10(1+(roll_sumr(lag(change, 21), 21*6)/lag(close, 21*7)))+
log10(1+(roll_sumr(lag(change, 21), 21*7)/lag(close, 21*8)))+
log10(1+(roll_sumr(lag(change, 21), 21*8)/lag(close, 21*9)))+
log10(1+(roll_sumr(lag(change, 21), 21*9)/lag(close, 21*10)))+
log10(1+(roll_sumr(lag(change, 21), 21*10)/lag(close, 21*11)))+
log10(1+(roll_sumr(lag(change, 21), 21*11)/lag(close, 21*12)))) %>%
mutate(RSTR_r=
log10(1+(roll_sumr(df_govt$change, 21)/lag(df_govt$close, 21)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21)/lag(df_govt$close, 21*2)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*2)/lag(df_govt$close, 21*3)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*3)/lag(df_govt$close, 21*4)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*4)/lag(df_govt$close, 21*5)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*5)/lag(df_govt$close, 21*6)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*6)/lag(df_govt$close, 21*7)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*7)/lag(df_govt$close, 21*8)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*8)/lag(df_govt$close, 21*9)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*9)/lag(df_govt$close, 21*10)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*10)/lag(df_govt$close, 21*11)))+
log10(1+(roll_sumr(lag(df_govt$change, 21), 21*11)/lag(df_govt$close, 21*12)))) %>%
mutate(RSTR=RSTR_l-RSTR_r) %>%
select(-RSTR_l, -RSTR_r)
}
df_aapl <- get_rstr(df_aapl)
df_cone <- get_rstr(df_cone)
df_fb <- get_rstr(df_fb)
df_goog <- get_rstr(df_goog)
df_vz <- get_rstr(df_vz)
# Function that returns a dataframe with the LNCAP
get_lncap <- function(df) {
df <- df %>%
mutate(LNCAP=log10(outS*lag(close, 1)))
}
df_aapl <- get_lncap(df_aapl)
df_cone <- get_lncap(df_cone)
df_fb <- get_lncap(df_fb)
df_goog <- get_lncap(df_goog)
df_vz <- get_lncap(df_vz)
# Function that returns a dataframe with the LCAPCB
get_lcapcb <- function(df) {
df <- df %>%
mutate(LCAPCB=(log10(outS*lag(close, 1)))^(1/3))
}
df_aapl <- get_lcapcb(df_aapl)
df_cone <- get_lcapcb(df_cone)
df_fb <- get_lcapcb(df_fb)
df_goog <- get_lcapcb(df_goog)
df_vz <- get_lcapcb(df_vz)
# Function that returns a dataframe with the BTOP
get_btop <- function(df) {
df <- df %>%
mutate(BTOP=book/(outS*lag(close, 1)))
}
df_aapl <- get_btop(df_aapl)
df_cone <- get_btop(df_cone)
df_fb <- get_btop(df_fb)
df_goog <- get_btop(df_goog)
df_vz <- get_btop(df_vz)
Here, we tidy our data up and make necessary adjustments (taking the residual factor values for example). We also separate our data into training and testing sets.
The datasets are shown after this process.
# Approximate Risk-free rate for the day
r_f <- mean((filter(df_govt, date<"2017-01-03"))$r, na.rm=TRUE)
# Trim data and make amenable to regression (get residual factor values)
trim_dat <- function(df) {
df <- df %>%
filter(!is.na(VLVR)) %>%
mutate(VLVR=VLVR-r_f,
HILO=HILO-r_f,
RSTR=RSTR-r_f,
LNCAP=LNCAP-r_f,
LCAPCB=LCAPCB-r_f,
BTOP=BTOP-r_f)
}
df_aapl <- trim_dat(df_aapl)
df_cone <- trim_dat(df_cone)
df_fb <- trim_dat(df_fb)
df_goog <- trim_dat(df_goog)
df_vz <- trim_dat(df_vz)
# Create our training and test sets
train_aapl <- filter(df_aapl, date<"2017-01-03")
test_aapl <- filter(df_aapl, date>="2017-01-03")
train_cone <- filter(df_cone, date<"2017-01-03")
test_cone <- filter(df_cone, date>="2017-01-03")
train_fb <- filter(df_fb, date<"2017-01-03")
test_fb <- filter(df_fb, date>="2017-01-03")
train_goog <- filter(df_goog, date<"2017-01-03")
test_goog <- filter(df_goog, date>="2017-01-03")
train_vz <- filter(df_vz, date<"2017-01-03")
test_vz <- filter(df_vz, date>="2017-01-03")
The contents of each dataframe, which is ready for regression.
Our AAPL dataframe:
head(df_aapl)
## # A tibble: 6 x 16
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-05-13 118. 119. 117. 3.47e7 0.135 1.29e11 5.76e9 1.15e-3
## 2 2015-05-14 120. 120. 119. 4.52e7 2.74 1.29e11 5.76e9 2.33e-2
## 3 2015-05-15 120. 121. 120. 3.82e7 -0.168 1.29e11 5.76e9 -1.40e-3
## 4 2015-05-18 121. 122. 120. 5.09e7 1.32 1.29e11 5.76e9 1.10e-2
## 5 2015-05-19 121. 122. 121. 4.46e7 -0.112 1.29e11 5.76e9 -9.22e-4
## 6 2015-05-20 121. 122. 121. 3.65e7 -0.00933 1.29e11 5.76e9 -7.69e-5
## # ... with 7 more variables: r_excess <dbl>, VLVR <dbl>, HILO[,1] <dbl>,
## # RSTR <dbl>, LNCAP <dbl>, LCAPCB <dbl>, BTOP <dbl>
Our CONE dataframe:
head(df_cone)
## # A tibble: 6 x 16
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-05-13 27.0 27.3 26.9 642730 -0.0175 4.61e8 3.91e7 -6.47e-4
## 2 2015-05-14 26.9 27.3 26.7 566434 -0.122 4.61e8 3.91e7 -4.53e-3
## 3 2015-05-15 27.4 27.4 26.9 461619 0.481 4.61e8 3.91e7 1.79e-2
## 4 2015-05-18 27.7 27.7 27.1 397508 0.332 4.61e8 3.91e7 1.21e-2
## 5 2015-05-19 28.0 28.1 27.5 520550 0.289 4.61e8 3.91e7 1.04e-2
## 6 2015-05-20 28.0 28.3 27.8 249690 0.0175 4.61e8 3.91e7 6.25e-4
## # ... with 7 more variables: r_excess <dbl>, VLVR <dbl>, HILO[,1] <dbl>,
## # RSTR <dbl>, LNCAP <dbl>, LCAPCB <dbl>, BTOP <dbl>
Our FB dataframe:
head(df_fb)
## # A tibble: 6 x 16
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-05-13 78.4 78.5 77.6 2.15e7 0.98 3.61e10 2.25e9 1.27e-2
## 2 2015-05-14 81.4 81.8 78.7 4.94e7 2.93 3.61e10 2.25e9 3.74e-2
## 3 2015-05-15 80.4 81.5 80.2 2.71e7 -0.95 3.61e10 2.25e9 -1.17e-2
## 4 2015-05-18 80.9 81.4 80.2 2.16e7 0.46 3.61e10 2.25e9 5.72e-3
## 5 2015-05-19 80.6 81.7 80.6 1.80e7 -0.25 3.61e10 2.25e9 -3.09e-3
## 6 2015-05-20 80.6 81.1 79.5 2.31e7 -0.08 3.61e10 2.25e9 -9.92e-4
## # ... with 7 more variables: r_excess <dbl>, VLVR <dbl>, HILO[,1] <dbl>,
## # RSTR <dbl>, LNCAP <dbl>, LCAPCB <dbl>, BTOP <dbl>
Our GOOG dataframe:
head(df_goog)
## # A tibble: 6 x 16
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-05-13 530. 534. 529. 1.25e6 0.580 1.05e11 3.42e8 0.00110
## 2 2015-05-14 538. 539 532. 1.40e6 8.78 1.05e11 3.42e8 0.0166
## 3 2015-05-15 534. 539. 530. 1.97e6 -4.55 1.05e11 3.42e8 -0.00845
## 4 2015-05-18 532. 535. 529. 2.00e6 -1.55 1.05e11 3.42e8 -0.00290
## 5 2015-05-19 537. 541. 533. 1.97e6 5.06 1.05e11 3.42e8 0.00951
## 6 2015-05-20 539. 543. 533. 1.43e6 1.91 1.05e11 3.42e8 0.00355
## # ... with 7 more variables: r_excess <dbl>, VLVR <dbl>, HILO[,1] <dbl>,
## # RSTR <dbl>, LNCAP <dbl>, LCAPCB <dbl>, BTOP <dbl>
Our VZ dataframe:
head(df_vz)
## # A tibble: 6 x 16
## date close high low vol change book outS r
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-05-13 41.4 41.6 41.2 1.62e7 0.0916 1.37e10 4.12e9 0.00222
## 2 2015-05-14 41.6 41.8 41.5 1.21e7 0.200 1.37e10 4.12e9 0.00483
## 3 2015-05-15 41.5 41.7 41.3 1.41e7 -0.150 1.37e10 4.12e9 -0.00360
## 4 2015-05-18 41.3 41.5 41.3 1.50e7 -0.158 1.37e10 4.12e9 -0.00382
## 5 2015-05-19 41.3 41.3 41.1 1.57e7 -0.0416 1.37e10 4.12e9 -0.00101
## 6 2015-05-20 41.4 41.6 41.3 1.36e7 0.117 1.37e10 4.12e9 0.00283
## # ... with 7 more variables: r_excess <dbl>, VLVR <dbl>, HILO[,1] <dbl>,
## # RSTR <dbl>, LNCAP <dbl>, LCAPCB <dbl>, BTOP <dbl>
We add the forecasted returns to our dataframes.
# Fit the results of our regression on the training data onto
# the test dataset
fit_dat <- function(df) {
md <- broom::tidy(lm(r~VLVR+HILO+RSTR+LNCAP+LCAPCB+BTOP, data=train_fb))
df$r_pred <- md[[1,2]]+md[[2,2]]*df$VLVR+
md[[3,2]]*df$HILO+md[[4,2]]*df$RSTR+
md[[5,2]]*df$LNCAP+md[[6,2]]*df$LCAPCB+
md[[7,2]]*df$BTOP
df
}
test_aapl <- fit_dat(test_aapl)
test_cone <- fit_dat(test_cone)
test_fb <- fit_dat(test_fb)
test_goog <- fit_dat(test_goog)
test_vz <- fit_dat(test_vz)
# Remove extraneous variables
rm(train_aapl, train_cone, train_fb, train_goog, train_vz,
df_aapl, df_cone, df_fb, df_goog, df_govt, df_vz)
# Prune data
test_aapl <- test_aapl %>%
select(date, "aapl_c"=close, "aapl_r"=r, "aapl_p"=r_pred)
test_cone <- test_cone %>%
select(date, "cone_c"=close, "cone_r"=r, "cone_p"=r_pred)
test_fb <- test_fb %>%
select(date, "fb_c"=close, "fb_r"=r, "fb_p"=r_pred)
test_goog <- test_goog %>%
select(date, "goog_c"=close, "goog_r"=r, "goog_p"=r_pred)
test_vz <- test_vz %>%
select(date, "vz_c"=close, "vz_r"=r, "vz_p"=r_pred)
Finally, we create and test our human portfolio on the test sample.
# Create portfolio
pfolio <- test_aapl %>%
right_join(test_cone, by="date") %>%
right_join(test_fb, by="date") %>%
right_join(test_goog, by="date") %>%
right_join(test_vz, by="date") %>%
slice(seq(1, nrow(test_cone), by=5))
pfolio$val <- NA
pfolio$val[1] <- 1000000
for (i in seq(1, nrow(pfolio)-1)) {
max_pred=max(pfolio[[i,4]],
pfolio[[i,7]],
pfolio[[i,10]],
pfolio[[i,13]],
pfolio[[i,16]])
# AAPL is the max prediction
if (max_pred==pfolio[[i,4]]) {
pfolio$val[i+1]=(pfolio$val[i]/pfolio[[i,2]])*pfolio[[i+1,2]]
} else if (max_pred==pfolio[[i,7]]) {
pfolio$val[i+1]=(pfolio$val[i]/pfolio[[i,5]])*pfolio[[i+1,5]]
} else if (max_pred==pfolio[[i,10]]) {
pfolio$val[i+1]=(pfolio$val[i]/pfolio[[i,8]])*pfolio[[i+1,8]]
} else if (max_pred==pfolio[[i,13]]) {
pfolio$val[i+1]=(pfolio$val[i]/pfolio[[i,11]])*pfolio[[i+1,11]]
} else if (max_pred==pfolio[[i,16]]) {
pfolio$val[i+1]=(pfolio$val[i]/pfolio[[i,14]])*pfolio[[i+1,14]]
}
}
It’s clear from the graph that the human portfolio outperforms the monkey portfolio. Having both stated with 1,000,000 dollars, both portfolios earned a profit, but the human portfolio earned the most.
# Read the performance of our monkey data
pfolio_monkey <- read_csv("./data/monkey_df.csv", col_names=
c("date",
"val",
"net_change"), col_types=
list(col_date(format="%Y-%m-%d"),
col_double(),
col_double()), skip=1)
# Join data
pfolio_monkey <- mutate(pfolio_monkey, p_type="Monkey")
pfolio <- mutate(pfolio, p_type="Human")
p <- full_join(pfolio, pfolio_monkey)
## Joining, by = c("date", "val", "p_type")
p$p_type <- factor(p$p_type)
p <- p %>%
arrange(date) %>%
group_by(p_type)
# Plot the data
ggplot(data=p, aes(x=date, y=val, color=p_type)) +
geom_line() +
labs(title="Portfolio value over test period",
x = "Date",
y = "Portfolio Value (Dollars)") +
theme(plot.title = element_text(hjust = 0.5))
Our human portfolio returned the following amount after the testing period:
pfolio[[nrow(pfolio),17]]
## [1] 1401404
The return for the human portfolio was
(pfolio[[nrow(pfolio),17]]-1000000)/1000000
## [1] 0.4014044
Our monkey portfolio returned the following amount after the testing period:
pfolio_monkey[[nrow(pfolio_monkey),2]]
## [1] 1200470
The return for the monkey portfolio was
(pfolio_monkey[[nrow(pfolio_monkey),2]]-1000000)/1000000
## [1] 0.2004698
Our tutorial confirms our initial hypothesis that humans outperform the theoretical monkeys in the arena of asset management. It does not, however, indicate that our particular strategy is ready for deployment in the real world. In application, our strategy’s returns would be diminished with the consideration of trading costs, be they liquidity demands or brokerage fees. Furthermore, we simplified many complications, which, in the real world, are unavoidable. For example, we can’t own fractions of a share, like in our program!
Furthermore, the theory of Arbitrage Pricing Theory neglects salient statistical metrics that are commonly used in data science to assess the applicability of a model, namely \(p\)-value assessment. Considering the overall design of our experiment, we could have returned more significant results by considering multiple monkey portfolios. As it stands, it is not sufficient to merely use one instance of a monkey portfolio. This decision, however, was made out of time constraints.
In conclusion, our tutorial confirms our hypothesis. That being said, our model is far from industry-ready. It can, however, be improved by combining concepts in data science and financial theory.
The following code was run once to extract the following daily historical data over the past 5 years from the date of the JSON call:
This data was extracted for the following equities, tradeable on the IEX exchange:
##
# Returns a dataframe of date, close, high, low, and volume
#
# @param: tick
# String, the stock ticker whose data we want
# @param: range
# String, the range of data. We will use "5y"
#
# @author: Jared Gill
iex_chart <- function(tick, range){
# Make REST call
r <- GET("https://api.iextrading.com/1.0/stock/market/batch",
query = list(symbols = tick,
types = "chart",
range = range, last = "5")
)
# Remove formatting
r1 <- content(r)
rtick <- enframe(unlist(r1[[tick]]$chart))
# Extract data
rdates <- filter(rtick, name=="date")
rclose <- filter(rtick, name=="close")
rhigh <- filter(rtick, name=="high")
rlow <- filter(rtick, name=="low")
rvolume <- filter(rtick, name=="volume")
rchange <- filter(rtick, name=="change")
# Make data frame
df <- data.frame(rdates[2], rclose[2], rhigh[2], rlow[2], rvolume[2],
rchange[2])
# Rename columns
df <- df %>%
rename(date="value", close="value.1", high="value.2", low="value.3",
volume="value.4", change="value.5")
return(df)
}
################################################################################
# Main
################################################################################
# Convert data to dataframes
df_aapl <- iex_chart("AAPL", "5y")
df_cone <- iex_chart("CONE", "5y")
df_fb <- iex_chart("FB", "5y")
df_goog <- iex_chart("GOOG", "5y")
df_govt <- iex_chart("GOVT", "5y")
df_vz <- iex_chart("VZ", "5y")
# Write to CSVs
write_csv(df_aapl, "aapl.csv")
write_csv(df_cone, "cone.csv")
write_csv(df_fb, "fb.csv")
write_csv(df_goog, "goog.csv")
write_csv(df_govt, "govt.csv")
write_csv(df_vz, "vz.csv")
Initially, we attempted to obtain curated balance sheet data directly from the IEX Cloud service. This data, however, was limiting and lacked the accuracy we required.
Thus, the balance sheet data was scrapped by hand from the SEC’s EDGAR Database. This process was not automated due to the uncompromising age of the database’s technology, which prevented even a simple parse of its HTML pages. As consequence, the balance sheet data was scrapped from each company’s 10-K and 10-Q SEC filings (10-K and 10-Q forms respectively are annual and quarterly reports on a company’s corporate standing). The following data was parsed:
This data was parsed for the following equities (click to be redirected to their respective 10-K and 10-Q filings on the EDGAR database):
Asset: Short for “financial asset,” which is a tradeable financial instrument.
Change: The change in an asset’s price from an initial time \(t_i\) to a final time \(t_f\). Defined as \[ P_{t_f} - P_{t_i} \]
Close: The price of an equity at the close of the trading day. The “price” of an equity is merely the dollar amount at which a transaction successfully occurred between seller and buyer. Thus, the close is the last transaction price for an equity on a specific trading day.
Equity: Used interchangeably with “stock.”
High: The highest price an equity achieved during the trading day.
Low: The lowest price an equity achieved during the trading day.
Price: Used interchangeably with “close.”
Return: The return of an asset \(A\) is defined as \[ R = \frac{P_{t_i + \Delta t} - P_{t_i}}{P_{t_i}}, \] where \(P_{t}\) refers to the price of \(A\) at some time \(t\). Thus, \(P_{t_i}\) is the price of \(A\) at some initial time \(t_i\), and \(P_{t_i + \Delta t}\) is the price of \(A\) at some future time, where \(\Delta t\) refers to the elapsed time since \(t_i\).
Risk-free Rate: The return of the theoretical risk-free asset.
Volume: The number of shares traded during a given period of time.