Time series data is an important type of data that is widely used in many fields, including finance, economics, and environmental science. In this blog, we will explore the importance of dealing with time series data and how R can help us to become better data scientists by providing us with the necessary tools to analyze and comprehend time series data. In this practice, we will work with a subset of analysts’ forecast data of earning per share (EPS) provided by Institutional Brokers’ Estimate System.
Let’s load the packages, call the dataset and glance the data we have by viewing the first and the last rows.
We have data for two companies, Apple and Advanced Auto.
TICKER: A unique identifier assigned to each security. In this blog, we have only two tickers: AAPL for Apple Company and AAPS for Advanced Auto. Company names are stored as cname.
analyst: The person who makes the forecast and work for sell-side institution. brokers and analysts are represented by codes to hide their real names.
forecast: This is the analyst’s earning per share (EPS) forecast for the company share.
FPEDATS: The Forecast Period End Date: It is the ending date of the fiscal period to which the estimate applies. For the majority of companies, the FPEDATS date is December 31st of that year.
ANNDATS: The Announce date: It is the date on which the analyst first made that particular estimate.
ANNTIMS: The precise announce date
ACTUAL: The actual EPS value announced by the company.
In order to better understand the data, a good starting point is to print the first row of the dataset and interpret what information it contains.
knitr::kable(head(eps, n=1))
ticker
cname
broker
analyst
forecast
FPEDATS
ANNDATS
ANNTIMS
ACTUAL
AAPL
APPLE
3037
106330
2.815
20180930
20171102
23:04:00
2.9775
On 2017‐Nov-2 (ANNDATS), analyst 106330 (analyst) working at broker house 3037 (broker) predicts that the EPS for Apple Computer (cname) with a ticker of AAPL (ticker ) with forecast period ending 30‐Sep-2017 (FPEDATS) is\$2.815 (forecast). This estimate was entered into the database at 23:04 (ANNTIMS), and APPLE announced an actual EPS of $2.9775 (ACTUAL).
Check for the missing values
One good news is that we have no missing data in our dataset, which means that we can proceed with our analysis without worrying about imputing missing values.
Order the data and declare some variables as date variable
# Order the eps dataset by "broker" and "analyst" in descending ordereps_ordered <- eps[order(-eps$broker, -eps$analyst), ]# display the first two rowsprint (head(eps_ordered,n=2))
ticker cname broker analyst forecast FPEDATS ANNDATS ANNTIMS ACTUAL
822 AAPL APPLE 4439 194536 4.00 20210930 20210119 5:32:00 5.61
912 AAPL APPLE 4439 194536 5.16 20210930 20210523 22:44:00 5.61
# Print the difference between the first and second rows in ANNDATSprint (eps_ordered$ANNDATS[2] - eps_ordered$ANNDATS[1])
[1] 404
The first row in eps_ordered has the “1/19/2021” ANNDATS entry, and the second row has “5/23/2021”. Therefore, there are just 124 days between the first and second rows. However, if we try to take the difference in R, we will get 404, which is incorrect. This is because the ANNDATS and FPEDATS variables are recorded as integers, not date variables. Additionally, the ANNTIMS variable is currently coded as a character variable, but it should also be coded as a date entry.
To convert the integer variables FPEDATS and ANNDATS and the character variable ANNTIMS to date variables in R, we will use the ymd() and the ymd_hms() functions in lubridate package. The ymd() function converts variables by specifying the year, month, and day format of the dates. The ymd_hms() function converts the character variable ANNTIMS to a date variable by specifying the year, month, day, hour, minute, and second format of the dates.
What if I want to calculate the total number of unique broker houses that provide forecast in a specific year, say 2020? Here’s an example R code that can be used to calculate the total number of “broker” when the “FPEDATS” variable is in the year 2020:
# Filter eps_ordered dataset to include only rows where FPEDATS is in year 2020eps_2020 <- eps_ordered[format(eps_ordered$FPEDATS, "%Y") =="2020", ]# Count number of broker houses in eps_2020 datasetnum_broker <-length(unique(eps_2020$broker))cat("The total number of brokers in 2020 is", num_broker)
The total number of brokers in 2020 is 41
Above, we first filter the eps_ordered dataset to include only rows where the year in the “FPEDATS” variable is 2020. We do this by converting the “FPEDATS” variable to a character format using the format() function and extracting the year using the %Y format specifier. Also, note that the unique() function in R is used to extract the unique values from a vector or data frame column.
Now, let’s identify the total number of analysts that provide forecasts for earning per share (EPS) for the Apple company in 2018 and store it as analyst_APPL_2018.
It is possible that an analyst can provide multiple EPS during the same calendar year. Hence, we need to calculate the distinct number of analyst.
# Filter eps_ordered dataset to include only rows where ticker is AAPL analyst_APPL_2018 <- eps_ordered %>%mutate(year =year(FPEDATS))%>%filter(ticker =='AAPL'& year==2018)%>%select(analyst)%>%n_distinct()cat("The total number of analysts that provide EPS forecasts for Apple in 2018 is", analyst_APPL_2018)
The total number of analysts that provide EPS forecasts for Apple in 2018 is 49
Above, we extracted the year from FPEDATS and filter to keep rows for APPL ticker and year of 2018, then calculate the distinct number of analysts. The n_distinct() function from the dplyr package counts the number of unique analysts in the resulting data frame. If we wanted to calculate the total number of broker houses that provide forecasts for earning per share (EPS) for the Apple company in 2018, we just needed replace analyst with broker in the code provided before.
This time, I want to identify the broker house (broker) with the highest number of analysts that provide forecasts for Advanced Auto (AAPS).
# Group the data by broker and count the number of unique analysts for each broker analyst_by_broker <- eps_ordered %>%mutate(year =year(FPEDATS))%>%filter(ticker =='AAPS')%>%group_by(broker) %>%summarise(num_analysts =n_distinct(analyst))# Find the broker with the highest number of unique analyststop_broker <- analyst_by_broker %>%filter(num_analysts ==max(num_analysts))cat("The broker house with the highest number of analysts that provide forecasts for AAPS is", top_broker$broker)
The broker house with the highest number of analysts that provide forecasts for AAPS is 18
Now, I want to to identify which broker house (broker) has the largest number of analysts providing forecasts for Apple (AAAP) in the fiscal year ending in 2021.
# Group the data by year and count the number of unique analysts for each broker in year 2021 APPLEanalyst_by_broker <- eps_ordered %>%mutate(year =year(FPEDATS))%>%filter(ticker =='AAPL'& year==2021)%>%group_by(broker) %>%summarise(num_analysts =n_distinct(analyst))# Find the broker(a) with the highest number of unique analyststop_broker2021 <- APPLEanalyst_by_broker %>%filter(num_analysts ==max(num_analysts))cat("The broker house(s) with the highest number of analysts that provide forecasts for AAPL in 2021 is", top_broker2021$broker)
The broker house(s) with the highest number of analysts that provide forecasts for AAPL in 2021 is 10 13 39
Drop observations: Latest forecast
It is quite possible that an analyst makes multiple forecasts throughout the year for the same fiscal period. I just want to keep the latest forecast for each calendar year by removing the earlier observations from the data set if an analyst has multiple predictions for the same year and keep the last one.
eps_lastforecast<-eps_ordered%>%arrange(ticker)%>%group_by(analyst, year=lubridate::year(ANNDATS))%>%slice_max(order_by = ANNDATS) %>%# Keep the latest forecast for each yearungroup()
We first arrange the data to sort by ticker using the arrange() function, then group the data by analyst and year using the group_by() function from dplyr. The year variable is created using the year() function from the lubridate package.
Next, we use the slice_max() function to keep the latest forecast for each year for each analyst, based on the “ANNDATS” variable. The order_by argument specifies the column to order the data before selecting the maximum value.
Let’s perform a sanity check for the AAPS ticker! Analyst 1395 from broker house 228 made three forecasts for the fiscal year ending in 2006: one on August 18th, another on November 2nd, and the final one on November 30th. The code successfully captures the latest forecasts for the AAPS ticker.
cat("The data dimension for eps_ordered is", dim(eps_ordered))
The data dimension for eps_ordered is 1910 9
Now, the sanity check for AAPL ticker! Analyst 194536 from broker house 4439 made three forecasts for the fiscal year ending in 2021: one on Januart 19th, another on April 23rd, and the final one on July 29th. We used the ‘slice_max()’ function in our code to filter the ‘eps_lastforecast’ dataset and keep only the latest forecast for this analyst.
# A tibble: 1 × 10
ticker cname broker analyst forecast FPEDATS ANNDATS ANNTIMS ACTUAL
<fct> <fct> <fct> <fct> <dbl> <date> <date> <Period> <dbl>
1 AAPL APPLE 4439 194536 5.57 2021-09-30 2021-07-29 5H 59M 0S 5.61
# ℹ 1 more variable: year <dbl>
cat("The data dimension for eps_lastforecast is", dim(eps_lastforecast))
The data dimension for eps_lastforecast is 457 10
All good! Now, we are ready for the next adventure in our coding exercise.
Calculate difference in days: Forecast horizon
To account for the higher uncertainty associated with EPS forecasts that have longer horizons, I create a new variable called ‘horizon’. This variable will capture the forecast horizon for each analyst per calendar year by calculating the time difference in days between the latest forecast and the EPS announcement date. The code below should give you the exact horizon only if FPEDATS and ANNDATS variables are in date format. Check your data structure with the str(eps_lastforecast) code to make sure it it the case.
Note that we have some negative horizon values indicating that the analyst forecasts entered into the database after the actual announcement date. Though I do not know the details, I can just speculate that MAYBE they have received the forecasts before the official announcements but the entry to the system was after the the announcement.
Calculate the forecast error and the lagged forecast error
A forecast can be considered good when it has a lower forecast error, as this indicates that the predicted values are closer to the actual values. Let’s calculate the forecast performance of each analyst by examining the forecast-ACTUAL distance to determine how far they are from the actual earning per share (EPS). If the distance is negative, then ACTUAL>forecast, the analyst undershoots the earning per share value. Likewise, if the distance is positive, then ACTUAL< forecast, the analyst overshoots the earning per share value. A good forecaster is someone who makes fewer instances of overshooting or undershooting the true value in their predictions.
There is an idiom in an alien language: “Gaxiz wunallar klofnal yijokni jasorub da yorub ynokni, da kilorub yzorub ginestal xutni. In English”A skilled star-gazer is one who avoids both over-jumping and under-jumping their destination, while always keeping their eyes on the cosmic trail.” You think I make this up, just check it with Elon Musk (Tweet Elon), he will confirm.
The code below creates a new column accuracyby measuring the distance between the predicted and the actual EPS value in our dataset.
It’s important to recognize that the current forecast error is not a reliable predictor of future forecast accuracy. This is because we can only determine whether a forecast is good or not after the actual value is known, making the current forecast error data irrelevant for prediction purposes. Instead, we must rely on past forecast accuracy as a key factor in assessing the current forecast quality of a forecaster.
The following code generates a new column named Accuracy_LastYear which incorporates the previous year’s forecast error as a predictor for this year’s EPS value.
We just created a low of missing values. When incorporating lagged forecast accuracy values into our dataset, missing values are inevitable since the first few rows of data will have no previous accuracy values to draw from.
In the world of finance, it is generally believed that analysts who have been monitoring a company for a prolonged period are more likely to make accurate predictions. This is due to the fact that they have a deeper understanding of the company’s inner workings and are better equipped to interpret trends and patterns in the data. Now, we will create a new variable called experience that tracks the number of years an analyst has been monitoring a particular company and making predictions. This variable is a cumulative count of the analyst’s years of experience with the company, and will be a key factor in our analysis moving forward. Further analysis will not be provided in this blog post, as this segment will be left for my UNT ADTA students to tackle on their own.
We first created a vector of 1s in eps_lastforecast dataset to calculate the cumulative number of years. It is possible that the same analyst can monitor multiple companies at the same time. So we first arrange our data. In R, the arrange() function is used to reorder the rows of a data frame, here we ordered it for ticker in descending order, the default selection. By grouping data with the group_by function, we used cumsum function to calculate the cumulative sum for each analyst. Here, the cumsum() function takes the track value and returns a vector or matrix of the same size containing the cumulative sum of the input, in here the number of years.
Sanity check!
It is important to verify that your code is performing as intended by conducting a thorough check. The code below prints the rows for the analyst 9947 working at broker 3037. Guess what, this person has the highest experience in our dataset.
Our analysis shows that analyst 9947, who works at broker 3037, had 10 years of experience in 2011, 11 years in 2012, and 12 years in 2013. There is no need to panic, our code is just doing fine. Just know that analysts may switch broker houses over time, so these results don’t necessarily indicate that the same analyst will remain at the same broker house indefinitely.
analyst 9947 worked at 3 different broker houses and is monitoring the ticker since 2002.
You can use a similar analysis to calculate the brokerage size. If a brokerage house have many analysts making predictions for the same company, it can be a sign of more resources allocated for company analysis.
Graphs
Now, we can calculate the consensus forecast by taking the average forecasts per year and plot against the actual EPS values for each company. This part is left for my UNT ADTA students to tackle on their own. Below is the graph of AAPS ticker with the actual EPS values against the mean forecasts. It is the benchmark forecasts which takes the average forecasts by all analysts as our best forecast of the EPS for that year. The dygraphs package is used which is a create tool to have interactive time-series plot.