analytics - Dealing with Time Series Data

Time series data is an important type of data that is widely used in many fields, including finance, economics, and environmental science. In this blog, we will explore the importance of dealing with time series data and how R can help us to become better data scientists by providing us with the necessary tools to analyze and comprehend time series data. In this practice, we will work with a subset of analysts’ forecast data of earning per share (EPS) provided by Institutional Brokers’ Estimate System.

Let’s load the packages, call the dataset and glance the data we have by viewing the first and the last rows.

Load the packages and read the csv file in R

library(lubridate)
library(dplyr)
library(ggplot2)
library(knitr)
library(dygraphs)
library(xts)
eps<-read.csv("eps.csv", header=T)

knitr::kable(rbind(head(eps, n=1), tail(eps,n=1)))

	ticker	cname	broker	analyst	forecast	FPEDATS	ANNDATS	ANNTIMS	ACTUAL
1	AAPL	APPLE	3037	106330	2.815	20180930	20171102	23:04:00	2.9775
1910	AAPS	ADVANCE AUTO	98	87084	5.500	20131231	20130503	1:20:00	5.3200

We have data for two companies, Apple and Advanced Auto.

TICKER: A unique identifier assigned to each security. In this blog, we have only two tickers: AAPL for Apple Company and AAPS for Advanced Auto. Company names are stored as cname.
broker: Sell-side institution (mostly broker house).
analyst: The person who makes the forecast and work for sell-side institution. brokers and analysts are represented by codes to hide their real names.
forecast: This is the analyst’s earning per share (EPS) forecast for the company share.
FPEDATS: The Forecast Period End Date: It is the ending date of the fiscal period to which the estimate applies. For the majority of companies, the FPEDATS date is December 31st of that year.
ANNDATS: The Announce date: It is the date on which the analyst first made that particular estimate.
ANNTIMS: The precise announce date
ACTUAL: The actual EPS value announced by the company.

str(eps)

'data.frame':   1910 obs. of  9 variables:
 $ ticker  : chr  "AAPL" "AAPL" "AAPL" "AAPL" ...
 $ cname   : chr  "APPLE" "APPLE" "APPLE" "APPLE" ...
 $ broker  : int  3037 31 3477 1951 2334 1267 171 157 464 228 ...
 $ analyst : int  106330 120641 75007 107846 56181 110208 113333 112843 9584 72069 ...
 $ forecast: num  2.81 2.66 2.95 3 2.71 ...
 $ FPEDATS : int  20180930 20180930 20180930 20180930 20180930 20180930 20180930 20180930 20180930 20180930 ...
 $ ANNDATS : int  20171102 20171102 20171102 20171102 20171102 20171102 20171102 20171102 20171103 20171103 ...
 $ ANNTIMS : chr  "23:04:00" "19:21:00" "22:55:00" "0:00:00" ...
 $ ACTUAL  : num  2.98 2.98 2.98 2.98 2.98 ...

In order to better understand the data, a good starting point is to print the first row of the dataset and interpret what information it contains.

knitr::kable(head(eps, n=1))

ticker	cname	broker	analyst	forecast	FPEDATS	ANNDATS	ANNTIMS	ACTUAL
AAPL	APPLE	3037	106330	2.815	20180930	20171102	23:04:00	2.9775

On 2017‐Nov-2 (ANNDATS), analyst 106330 (analyst) working at broker house 3037 (broker) predicts that the EPS for Apple Computer (cname) with a ticker of AAPL (ticker ) with forecast period ending 30‐Sep-2017 (FPEDATS) is\$2.815 (forecast). This estimate was entered into the database at 23:04 (ANNTIMS), and APPLE announced an actual EPS of $2.9775 (ACTUAL).

Check for the missing values

One good news is that we have no missing data in our dataset, which means that we can proceed with our analysis without worrying about imputing missing values.

sapply(eps, function(x) sum(is.na(x)))

  ticker    cname   broker  analyst forecast  FPEDATS  ANNDATS  ANNTIMS 
       0        0        0        0        0        0        0        0 
  ACTUAL 
       0

Order the data and declare some variables as date variable

# Order the eps dataset by "broker" and "analyst" in descending order
eps_ordered <- eps[order(-eps$broker, -eps$analyst), ]

# display the first two rows
print (head(eps_ordered,n=2))

    ticker cname broker analyst forecast  FPEDATS  ANNDATS  ANNTIMS ACTUAL
822   AAPL APPLE   4439  194536     4.00 20210930 20210119  5:32:00   5.61
912   AAPL APPLE   4439  194536     5.16 20210930 20210523 22:44:00   5.61

# Print the difference between the first and second rows in ANNDATS
print (eps_ordered$ANNDATS[2] - eps_ordered$ANNDATS[1])

[1] 404

The first row in eps_ordered has the “1/19/2021” ANNDATS entry, and the second row has “5/23/2021”. Therefore, there are just 124 days between the first and second rows. However, if we try to take the difference in R, we will get 404, which is incorrect. This is because the ANNDATS and FPEDATS variables are recorded as integers, not date variables. Additionally, the ANNTIMS variable is currently coded as a character variable, but it should also be coded as a date entry.

To convert the integer variables FPEDATS and ANNDATS and the character variable ANNTIMS to date variables in R, we will use the ymd() and the ymd_hms() functions in lubridate package. The ymd() function converts variables by specifying the year, month, and day format of the dates. The ymd_hms() function converts the character variable ANNTIMS to a date variable by specifying the year, month, day, hour, minute, and second format of the dates.

eps_ordered <- eps_ordered %>%
  mutate(FPEDATS    = lubridate::ymd(FPEDATS))%>%
  mutate(ANNDATS     = lubridate::ymd(ANNDATS ))%>%
  mutate(ANNTIMS      = lubridate::hms(ANNTIMS  ))



# Print the updated dataframe
print (eps_ordered$ANNDATS[2] - eps_ordered$ANNDATS[1])

Time difference of 124 days

Declare some variables as factor

Next, we declare ‘ticker’, ‘cname’, ‘broker’, and ‘analyst’ variables in ‘eps_ordered’ dataset as factor.

list<-c('ticker', 'cname', 'broker','analyst')
eps_ordered[list]<-lapply(eps_ordered[list], factor)

str(eps_ordered)

'data.frame':   1910 obs. of  9 variables:
 $ ticker  : Factor w/ 2 levels "AAPL","AAPS": 1 1 1 1 1 1 1 1 1 1 ...
 $ cname   : Factor w/ 2 levels "ADVANCE AUTO",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ broker  : Factor w/ 89 levels "4","11","16",..: 89 89 89 88 87 87 86 86 86 86 ...
 $ analyst : Factor w/ 140 levels "1395","1657",..: 137 137 137 140 22 22 123 123 123 123 ...
 $ forecast: num  4 5.16 5.57 5.91 2.87 ...
 $ FPEDATS : Date, format: "2021-09-30" "2021-09-30" ...
 $ ANNDATS : Date, format: "2021-01-19" "2021-05-23" ...
 $ ANNTIMS :Formal class 'Period' [package "lubridate"] with 6 slots
  .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
  .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
  .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
  .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
  .. ..@ hour  : num  5 22 5 12 11 0 13 15 14 17 ...
  .. ..@ minute: num  32 44 59 39 33 0 32 2 50 42 ...
 $ ACTUAL  : num  5.61 5.61 5.61 5.61 2.97 ...

Explore your data

What if I want to calculate the total number of unique broker houses that provide forecast in a specific year, say 2020? Here’s an example R code that can be used to calculate the total number of “broker” when the “FPEDATS” variable is in the year 2020:

# Filter eps_ordered dataset to include only rows where FPEDATS is in year 2020
eps_2020 <- eps_ordered[format(eps_ordered$FPEDATS, "%Y") == "2020", ]

# Count number of broker houses in eps_2020 dataset
num_broker <- length(unique(eps_2020$broker))
cat("The total number of brokers in 2020 is", num_broker)

The total number of brokers in 2020 is 41

Above, we first filter the eps_ordered dataset to include only rows where the year in the “FPEDATS” variable is 2020. We do this by converting the “FPEDATS” variable to a character format using the format() function and extracting the year using the %Y format specifier. Also, note that the unique() function in R is used to extract the unique values from a vector or data frame column.

Now, let’s identify the total number of analysts that provide forecasts for earning per share (EPS) for the Apple company in 2018 and store it as analyst_APPL_2018.

It is possible that an analyst can provide multiple EPS during the same calendar year. Hence, we need to calculate the distinct number of analyst.

# Filter eps_ordered dataset to include only rows where ticker is AAPL

  analyst_APPL_2018 <- eps_ordered %>%
  mutate(year = year(FPEDATS))%>%
  filter(ticker == 'AAPL' & year==2018)%>%
  select(analyst)%>%
  n_distinct()
  
cat("The total number of analysts that provide EPS forecasts for Apple in 2018 is", analyst_APPL_2018)

The total number of analysts that provide EPS forecasts for Apple in 2018 is 49

Above, we extracted the year from FPEDATS and filter to keep rows for APPL ticker and year of 2018, then calculate the distinct number of analysts. The n_distinct() function from the dplyr package counts the number of unique analysts in the resulting data frame. If we wanted to calculate the total number of broker houses that provide forecasts for earning per share (EPS) for the Apple company in 2018, we just needed replace analyst with broker in the code provided before.

This time, I want to identify the broker house (broker) with the highest number of analysts that provide forecasts for Advanced Auto (AAPS).

# Group the data by broker and count the number of unique analysts for each broker
  analyst_by_broker <- eps_ordered %>%
  mutate(year = year(FPEDATS))%>%
  filter(ticker == 'AAPS')%>%
  group_by(broker) %>% 
  summarise(num_analysts = n_distinct(analyst))
  
  

# Find the broker with the highest number of unique analysts
top_broker <- analyst_by_broker %>% filter(num_analysts == max(num_analysts))

cat("The broker house with the highest number of analysts that provide forecasts for AAPS is", top_broker$broker)

The broker house with the highest number of analysts that provide forecasts for AAPS is 18

Now, I want to to identify which broker house (broker) has the largest number of analysts providing forecasts for Apple (AAAP) in the fiscal year ending in 2021.

# Group the data by year and count the number of unique analysts for each broker in year 2021
  APPLEanalyst_by_broker <- eps_ordered %>%
  mutate(year = year(FPEDATS))%>%
  filter(ticker == 'AAPL'& year==2021)%>%
  group_by(broker) %>% 
  summarise(num_analysts = n_distinct(analyst))
  
  

# Find the broker(a) with the highest number of unique analysts
top_broker2021 <- APPLEanalyst_by_broker %>% filter(num_analysts == max(num_analysts))

cat("The broker house(s) with the highest number of analysts that provide forecasts for AAPL in 2021  is", top_broker2021$broker)

The broker house(s) with the highest number of analysts that provide forecasts for AAPL in 2021  is 10 13 39

Drop observations: Latest forecast

It is quite possible that an analyst makes multiple forecasts throughout the year for the same fiscal period. I just want to keep the latest forecast for each calendar year by removing the earlier observations from the data set if an analyst has multiple predictions for the same year and keep the last one.

eps_lastforecast<-eps_ordered%>%
  arrange(ticker)%>%
  group_by(analyst, year=lubridate::year(ANNDATS))%>%
   slice_max(order_by = ANNDATS) %>%  # Keep the latest forecast for each year
  ungroup()

We first arrange the data to sort by ticker using the arrange() function, then group the data by analyst and year using the group_by() function from dplyr. The year variable is created using the year() function from the lubridate package.

Next, we use the slice_max() function to keep the latest forecast for each year for each analyst, based on the “ANNDATS” variable. The order_by argument specifies the column to order the data before selecting the maximum value.

Let’s perform a sanity check for the AAPS ticker! Analyst 1395 from broker house 228 made three forecasts for the fiscal year ending in 2006: one on August 18th, another on November 2nd, and the final one on November 30th. The code successfully captures the latest forecasts for the AAPS ticker.

eps_ordered%>%
  filter(broker==228 & analyst==1395& year(FPEDATS)==2006)

  ticker        cname broker analyst forecast    FPEDATS    ANNDATS   ANNTIMS
1   AAPS ADVANCE AUTO    228    1395     2.14 2006-12-31 2006-08-18 9H 39M 0S
2   AAPS ADVANCE AUTO    228    1395     2.17 2006-12-31 2006-11-02 15H 0M 0S
3   AAPS ADVANCE AUTO    228    1395     2.15 2006-12-31 2006-11-30 4H 22M 0S
  ACTUAL
1   2.16
2   2.16
3   2.16

eps_lastforecast%>%
  filter(broker==228 & analyst==1395& year(FPEDATS)==2006)

# A tibble: 1 × 10
  ticker cname    broker analyst forecast FPEDATS    ANNDATS    ANNTIMS   ACTUAL
  <fct>  <fct>    <fct>  <fct>      <dbl> <date>     <date>     <Period>   <dbl>
1 AAPS   ADVANCE… 228    1395        2.15 2006-12-31 2006-11-30 4H 22M 0S   2.16
# ℹ 1 more variable: year <dbl>

cat("The data dimension for eps_ordered is", dim(eps_ordered))

The data dimension for eps_ordered is 1910 9

Now, the sanity check for AAPL ticker! Analyst 194536 from broker house 4439 made three forecasts for the fiscal year ending in 2021: one on Januart 19th, another on April 23rd, and the final one on July 29th. We used the ‘slice_max()’ function in our code to filter the ‘eps_lastforecast’ dataset and keep only the latest forecast for this analyst.

eps_ordered%>%
  filter(broker==4439 & analyst==194536& year(FPEDATS)==2021)

  ticker cname broker analyst forecast    FPEDATS    ANNDATS    ANNTIMS ACTUAL
1   AAPL APPLE   4439  194536     4.00 2021-09-30 2021-01-19  5H 32M 0S   5.61
2   AAPL APPLE   4439  194536     5.16 2021-09-30 2021-05-23 22H 44M 0S   5.61
3   AAPL APPLE   4439  194536     5.57 2021-09-30 2021-07-29  5H 59M 0S   5.61

eps_lastforecast%>%
  filter(broker==4439 & analyst==194536& year(FPEDATS)==2021)

# A tibble: 1 × 10
  ticker cname broker analyst forecast FPEDATS    ANNDATS    ANNTIMS   ACTUAL
  <fct>  <fct> <fct>  <fct>      <dbl> <date>     <date>     <Period>   <dbl>
1 AAPL   APPLE 4439   194536      5.57 2021-09-30 2021-07-29 5H 59M 0S   5.61
# ℹ 1 more variable: year <dbl>

cat("The data dimension for eps_lastforecast is", dim(eps_lastforecast))

The data dimension for eps_lastforecast is 457 10

All good! Now, we are ready for the next adventure in our coding exercise.

Calculate difference in days: Forecast horizon

To account for the higher uncertainty associated with EPS forecasts that have longer horizons, I create a new variable called ‘horizon’. This variable will capture the forecast horizon for each analyst per calendar year by calculating the time difference in days between the latest forecast and the EPS announcement date. The code below should give you the exact horizon only if FPEDATS and ANNDATS variables are in date format. Check your data structure with the str(eps_lastforecast) code to make sure it it the case.

eps_lastforecast<- eps_lastforecast %>% 
  mutate(horizon=FPEDATS    -ANNDATS)

head(eps_lastforecast, n=4)

# A tibble: 4 × 11
  ticker cname   broker analyst forecast FPEDATS    ANNDATS    ANNTIMS    ACTUAL
  <fct>  <fct>   <fct>  <fct>      <dbl> <date>     <date>     <Period>    <dbl>
1 AAPS   ADVANC… 228    1395        2.15 2006-12-31 2006-11-30 4H 22M 0S    2.16
2 AAPS   ADVANC… 228    1395        2.34 2007-12-31 2007-11-01 15H 47M 0S   2.31
3 AAPS   ADVANC… 228    1395        2.63 2008-12-31 2008-10-30 12H 27M 0S   2.75
4 AAPS   ADVANC… 228    1395        3.06 2009-12-31 2009-11-12 8H 39M 0S    3   
# ℹ 2 more variables: year <dbl>, horizon <drtn>

Note that we have some negative horizon values indicating that the analyst forecasts entered into the database after the actual announcement date. Though I do not know the details, I can just speculate that MAYBE they have received the forecasts before the official announcements but the entry to the system was after the the announcement.

Calculate the forecast error and the lagged forecast error

A forecast can be considered good when it has a lower forecast error, as this indicates that the predicted values are closer to the actual values. Let’s calculate the forecast performance of each analyst by examining the forecast-ACTUAL distance to determine how far they are from the actual earning per share (EPS). If the distance is negative, then ACTUAL>forecast, the analyst undershoots the earning per share value. Likewise, if the distance is positive, then ACTUAL< forecast, the analyst overshoots the earning per share value. A good forecaster is someone who makes fewer instances of overshooting or undershooting the true value in their predictions.

There is an idiom in an alien language: “Gaxiz wunallar klofnal yijokni jasorub da yorub ynokni, da kilorub yzorub ginestal xutni. In English”A skilled star-gazer is one who avoids both over-jumping and under-jumping their destination, while always keeping their eyes on the cosmic trail.” You think I make this up, just check it with Elon Musk (Tweet Elon), he will confirm.
The code below creates a new column accuracyby measuring the distance between the predicted and the actual EPS value in our dataset.

eps_lastforecast<- eps_lastforecast %>%mutate(accuracy =forecast-ACTUAL)

It’s important to recognize that the current forecast error is not a reliable predictor of future forecast accuracy. This is because we can only determine whether a forecast is good or not after the actual value is known, making the current forecast error data irrelevant for prediction purposes. Instead, we must rely on past forecast accuracy as a key factor in assessing the current forecast quality of a forecaster.

The following code generates a new column named Accuracy_LastYear which incorporates the previous year’s forecast error as a predictor for this year’s EPS value.

eps_lastforecast<- eps_lastforecast %>% arrange(ticker, analyst,FPEDATS) %>% group_by(analyst) %>% mutate(Accuracy_LastYear=lag(forecast)-lag(ACTUAL))

We just created a low of missing values. When incorporating lagged forecast accuracy values into our dataset, missing values are inevitable since the first few rows of data will have no previous accuracy values to draw from.

sapply(eps_lastforecast, function(x) sum(is.na(x)))

           ticker             cname            broker           analyst 
                0                 0                 0                 0 
         forecast           FPEDATS           ANNDATS           ANNTIMS 
                0                 0                 0                 0 
           ACTUAL              year           horizon          accuracy 
                0                 0                 0                 0 
Accuracy_LastYear 
              140

Calculate the experience

In the world of finance, it is generally believed that analysts who have been monitoring a company for a prolonged period are more likely to make accurate predictions. This is due to the fact that they have a deeper understanding of the company’s inner workings and are better equipped to interpret trends and patterns in the data. Now, we will create a new variable called experience that tracks the number of years an analyst has been monitoring a particular company and making predictions. This variable is a cumulative count of the analyst’s years of experience with the company, and will be a key factor in our analysis moving forward. Further analysis will not be provided in this blog post, as this segment will be left for my UNT ADTA students to tackle on their own.

eps_lastforecast$track<-1


eps_lastforecast<- eps_lastforecast %>%
  arrange(ticker)%>%
  group_by(analyst) %>% 
  mutate("experience"=cumsum(track))

We first created a vector of 1s in eps_lastforecast dataset to calculate the cumulative number of years. It is possible that the same analyst can monitor multiple companies at the same time. So we first arrange our data. In R, the arrange() function is used to reorder the rows of a data frame, here we ordered it for ticker in descending order, the default selection. By grouping data with the group_by function, we used cumsum function to calculate the cumulative sum for each analyst. Here, the cumsum() function takes the track value and returns a vector or matrix of the same size containing the cumulative sum of the input, in here the number of years.

Sanity check!

It is important to verify that your code is performing as intended by conducting a thorough check. The code below prints the rows for the analyst 9947 working at broker 3037. Guess what, this person has the highest experience in our dataset.

eps_lastforecast%>%
  filter(broker==3037 & analyst==9947)

# A tibble: 3 × 15
# Groups:   analyst [1]
  ticker cname   broker analyst forecast FPEDATS    ANNDATS    ANNTIMS    ACTUAL
  <fct>  <fct>   <fct>  <fct>      <dbl> <date>     <date>     <Period>    <dbl>
1 AAPS   ADVANC… 3037   9947        5    2011-12-31 2011-11-09 18H 8M 0S    5.11
2 AAPS   ADVANC… 3037   9947        5.05 2012-12-31 2012-10-22 10H 6M 0S    5.22
3 AAPS   ADVANC… 3037   9947        5.45 2013-12-31 2013-04-10 13H 29M 0S   5.32
# ℹ 6 more variables: year <dbl>, horizon <drtn>, accuracy <dbl>,
#   Accuracy_LastYear <dbl>, track <dbl>, experience <dbl>

Our analysis shows that analyst 9947, who works at broker 3037, had 10 years of experience in 2011, 11 years in 2012, and 12 years in 2013. There is no need to panic, our code is just doing fine. Just know that analysts may switch broker houses over time, so these results don’t necessarily indicate that the same analyst will remain at the same broker house indefinitely.

Here is a better picture.

eps_lastforecast%>%
  filter(analyst==9947)

# A tibble: 12 × 15
# Groups:   analyst [1]
   ticker cname  broker analyst forecast FPEDATS    ANNDATS    ANNTIMS    ACTUAL
   <fct>  <fct>  <fct>  <fct>      <dbl> <date>     <date>     <Period>    <dbl>
 1 AAPS   ADVAN… 308    9947       0.887 2002-12-31 2002-11-19 11H 38M 0S  0.893
 2 AAPS   ADVAN… 308    9947       1.42  2003-12-31 2003-12-09 1H 20M 0S   1.51 
 3 AAPS   ADVAN… 308    9947       1.67  2004-12-31 2004-12-01 1H 9M 0S    1.68 
 4 AAPS   ADVAN… 308    9947       2.13  2005-12-31 2005-08-11 1H 27M 0S   2.13 
 5 AAPS   ADVAN… 308    9947       2.16  2006-12-31 2006-11-30 1H 40M 0S   2.16 
 6 AAPS   ADVAN… 308    9947       2.4   2007-12-31 2007-02-15 12H 45M 0S  2.31 
 7 AAPS   ADVAN… 183    9947       2.57  2008-12-31 2008-12-17 14H 47M 0S  2.75 
 8 AAPS   ADVAN… 183    9947       3.05  2009-12-31 2009-11-12 13H 34M 0S  3    
 9 AAPS   ADVAN… 183    9947       4.03  2010-12-31 2010-11-10 20H 35M 0S  3.95 
10 AAPS   ADVAN… 3037   9947       5     2011-12-31 2011-11-09 18H 8M 0S   5.11 
11 AAPS   ADVAN… 3037   9947       5.05  2012-12-31 2012-10-22 10H 6M 0S   5.22 
12 AAPS   ADVAN… 3037   9947       5.45  2013-12-31 2013-04-10 13H 29M 0S  5.32 
# ℹ 6 more variables: year <dbl>, horizon <drtn>, accuracy <dbl>,
#   Accuracy_LastYear <dbl>, track <dbl>, experience <dbl>

analyst 9947 worked at 3 different broker houses and is monitoring the ticker since 2002.

You can use a similar analysis to calculate the brokerage size. If a brokerage house have many analysts making predictions for the same company, it can be a sign of more resources allocated for company analysis.

Graphs

Now, we can calculate the consensus forecast by taking the average forecasts per year and plot against the actual EPS values for each company. This part is left for my UNT ADTA students to tackle on their own. Below is the graph of AAPS ticker with the actual EPS values against the mean forecasts. It is the benchmark forecasts which takes the average forecasts by all analysts as our best forecast of the EPS for that year. The dygraphs package is used which is a create tool to have interactive time-series plot.