**“A lot of people in practice talk about stationarity, but they define it very very loosely. They actually talk about stationarity as if the times don't change over time.” - Jeffrey Yau**

Thank you for coming. I head the data science team at Alliance Bernstein and previously different data science and quant teams. I got my PhD in economics from Penn and studied math when I was an undergrad.

Whenever I give a talk or when people look at my LinkedIn profile, they may think that the only thing that I do is economics and finance. But actually, my biggest interest is these days is design. I have a background doing a lot of work for developing countries and I still very much enjoy doing that. I am very much involved in different data science communities and am still teaching at UC Berkeley within their Masters of Data Science program online.

This talk is really for data scientists and practitioners who want to do time series forecasting. Hopefully, after today, you will understand the characteristics of a time series forecasting problem, some basic intuition of a particular type of auto regressive model, and neural network models as they apply to the overall structure of Python implementation. The two types of models that I'm going to show today are two extremely inferential models in statistics and machine learning. For example, vector autoregressive models in economics were actually extremely popular since the 80s and have already been published in economics papers. Second, long and short-term memory within your network recurs frequently in the industry and many of you probably have heard about its numerous applications, powering many of our devices today.

The number one thing I want to do is set up a time series forecasting problem so that we are all on the same page. Then, I want to introduce the ARIMA model before I talk about the vector autoregressive model. After that, I will talk about particular types of recurrent neural networks and for each type, I will formulate a model using partly intuition and also some code and actual Python implementations. I'll conclude by comparing all these models.

When it comes to time series forecasting problems, the main goal is to predict the future using the current information set. These sets will include both values of the same series as well as something what we call the exogenous series. One model exists just to formulate the concept a little bit. We call the forecast Y-hat with the subscript T plus X. This means that we want to forecast, if right now is time T, the period expiration from today. At this basic point, you cannot foresee the future. You could be starting with a statistical model, or it could be machine learning model, or simply just rules and the information set. In time series, it could be as simple as forecasting for tomorrow.

You don’t have to use all the information that has been given to you. Here, for example, you are only using one data point in this series. That is not a very efficient use of the other information you have. In fact, this type of forecast is used in practice is called pre-system forecast.

Well of course, we can do better than just using one data point in a forecast. Some of you may have used the rolling average. I call it rolling because I want to distinguish it from the moving average model that we are going to talk about when we look at the ARIMA model. The rolling average involves just taking equally weighted averages across the last “k” time periods and that is your forecast for the next period. Now you can have a slightly better variation of that, which is instead of using an equally weighted average you can have a moving weighted average with the weight exponentially declining over time. Of course, that's just a very simple model that we can construct. We've done a lot of programming for the forecast, but the point is that forecasting does not necessarily have to be extremely difficult, although there are many models that can be used to capture very complicated dynamics in a time series.

In terms of the ARIMA model, the focus is the statistical relationship of just one time series and this is also called a univariate statistical time series model. Again, we use the historical information from a particular series we are thinking about to predict the observation for the future time “T” plus 1. It can also depend on the exogenous series. ARIMA was derived from “auto regressive integrated moving average” model. ARIMA uses linear functions with the mean of the series and the lagged value, for example YT minus 1 YT minus 2 all the way the lacteal values of the series. You also have the shock as well, so unlike linear regressions, you only have one shock term. Here you actually have the lecture term as well. I have a three hour tutorial from PYData in San Francisco 2016. It’s actually available on YouTube I think, so if you want to learn more about our ARIMA model you can look at that tutorial.

*See talks like this in person at our next Data Science Salon: Applying AI and Machine Learning to Finance, Healthcare and Technology, in Miami, Florida.*

So we just talked about univariate time series analysis. Next, we will focus on multivariate time series analysis. So what does it mean? Well first of all in a multivariate time series model, we focus on modeling the dynamic properties of the model, meaning the evolution of a particular time series. However, we may also want to model the interrelationship across time series. However, we still have to define what “F” is. “F” is just the general function in most cases, outlining the system of equations. In the context of time series analysis, the mean and the variance don't change over time and the variance covariance matrix is only a function of the lag.

A lot of people in practice talk about stationarity, but they define it very very loosely. They actually talk about stationarity as if the times don't change over time. That's easy to forecast, but then this model actually can be applied to non-stationary series via some very simple transformation. If it can be transformed using finite difference, hen you can model using one difference. However, the non-stationary time series is cointegrated, meaning a linear combination of observations become a stationary time series. Then, you actually have to apply the vector error correction models I mentioned earlier.

When building a bar model, keep in mind it only depends on the first lag and for the second equation, it’ll be the same thing. However, this is just an uncoupling system so each of the equations does not affect the other equations. However, you actually need to have the left hand side for other series in here as well. All of these lags and all of the equations will not generally show up when you read a textbook or paper. These types of formats will deal more with matrix formulation trees because otherwise you have too many equations and variables to keep track of.

I want to talk now about the journal steps needed to build our model. It doesn’t necessarily need to be Python, but that is what I choose to illustrate with in my class at Berkeley. I am going to focus primarily on exploratory data analysis, which I think is most needed whenever you analyze data. Exploratory data analysis involves looking at some graphs in order to plot your time series. Then, on top of that, you want to examine the correlations between and within the series. The correlations of the series itself is called auto correlation and it is nothing more than the correlation of the series and its own lags. I always encourage my students to really understand exactly what are they are plotting in terms of the formulas behind each and every single one of these objects. Of course you sometimes also plot the histogram to determine the order of the model specifically. You do not formally need the histogram, but in practicing data science, you will notice that many of your colleagues may only look at the time series using a density graph. The histogram tells you absolutely nothing about the time dependency of the series itself.

In Python, it's actually fairly easy to write down a few functions to plot. I typically join them into one function, so then I can repeatedly plot them, because this series is not stationary and I need to transform it into a stationary series. After the transformation, you have to do the EBA process again, meaning that you have to keep plotting this graph again and again. It will be better for you in your workflow if you write down a function itself.

The transformations that I'm going to make here is to take the log and the difference of this series. In a lot of macroeconomic and financial time series and business in general, transformance includes using the log. When the ratio between the time lags is small, the log difference transformation is almost like a percentage and close to the percentage change. It’s important to difference out the seasonality when you do these procedures to make sure you did it correctly. You always want to list out the observations because when you need several observations to do one thing, you don’t want to lose any child observations in the integration process.

When you have numerous cross correlations, this is where Python and a step model comes in. If you want to implement a step model, you can pull up the TSA library under VOMIT, which stands for vector autoregressive moving average model. The code itself is not difficult but actually understanding what you’re implementing is very important. The output of the code should have all the statistics that you need in terms of likelihood functions and likelihood value. Most useful are the statistics associated with the procedure, which we can use for later models. We also get the coefficients and associate statistics. You can do model selection as well by looping through multiple values, but you’re still not done yet. You can’t begin to forecast using the model that you currently have because that model is only appropriate for the transformation series. Before forecasting, you have to reverse the transformation in order to get back your original units. You can do this using a few lines of Python code. When I work with data scientists for my students, they will take the model they develop as the law. And I ask them what does it mean in the context of the whole experiment? If you want to forecast effectively in the presence of unknowns, you always have to relate your numbers back to the context.

I want to use my remaining time to talk about your network. Back in the early 90s, there was a lot of forecasting being done using a feed-forward network itself. The limitations of this format of information is that it does not account for time ordering. Also, if you input a process independently, then there is literally no device to keep the pass information intact. Recurrent neural networks, some of which were formed in the late 80s, can retain the pass information that can track the stay of a model as well as updated stay. A recurrent network will take into account the functions of the previous hidden stay as well as previous information. Using this model correctly, you will unfold the memories of the hidden unit over time because the hidden unit is supposed to be the layer that is recurrent. Sigmoid functions are being used to control for the information flow to make this happen. If it is zero, none of the information goes through. If it is one, all the information goes through. All implications of our model fit within this window and by trimming our model to its optimal state, we can forecast the series with increased accuracy.

* *

*For more content like this, don’t miss the next Data Science Salon in Miami, on September 10-11, 2019.*