My name is Josh Malina, I work at American Express. Today we'll be talking about time series analysis with pandas. American Express is a globally integrated payments company - we provide credit cards and other services to enrich people's lives. I work on the big data and machine learning team where we work on time series forecasting, anomaly detection, information retrieval, and other NLP stuff.
So let’s talk about time series data and how pandas can help us pull information out of time series data sets. After this talk you’ll understand time series decomposition, hypothesis testing, data quality issues, and resampling methods. A time series data set is simply the data set where the events that occur in it are pegged to a specific time.
So with that we'll move over to the notebook. This data set is a publicly available data set: car accidents in Seattle, Washington. We're gonna be digging through the data set, and using pandas to hopefully extract true information.
This talk is not about forecasting or prediction or neural nets, it's about looking into a data set you've never seen before, a time series data set, and hopefully answering questions with that data. Some of it will be very basic, especially for people who use pandas alot or who are very familiar with time series. Some of it might be more interesting to people who have ever used some of the time series specific functionality in pandas. I've been working with time series data for about nine months now and so this is sort of an overview of what I've learned in the better part of the year.
So at the top of the notebook we have our imports. A couple of things that I learned while building this notebook is that you can set a default figure size, which is super useful because you don't have to keep telling pandas or Matplotlib “make my pictures bigger”. Right, so that's there in this first cell. Here's a few links, when I share the notebook later you can take a look at the CSV file that we're parsing, a link to the page where it lives, and a data description. This is a very interesting data set – it has a lot of really cool columns we're not going to use at all, but just to give you a sense of how interesting this data set is you can see there's stuff about deer here and other domestic animals that you might have hit with your car. So, feel free after the talk to dig in and let me know what you found.
So we'll quickly look at our first pandas function. This is the mother of all pandas functions: read CSV. If you look at the documentation for it you'll be looking at it for a long time. It has lots and lots of arguments that it accepts. Here we're doing a simple function call where we're just passing the path to the data – telling it which columns to parse. If you have a very large data set you want to use the second argument here so that you don't read every single column.
Pandas also provides a pretty handy dandy parse dates argument, so if you have a date/time string, or something that looks like a date/time string, pandas under-the-hood will sort of automatically parse it out and try to make it into a Python date/time object. Index column: the column you want to index on, and then sorting it. The sort index: when you call this function the thing that's going to be the slowest is parse dates, and then maybe sort index afterwards. But it's important to sort it because then you know you're starting at the beginning of the data set since it sorts it in ascending order.
So you can see a few of the columns that I'm reading in. So since we have a time index data frame we have a lot of really sort of simple ways to get a high-level view of it. And when you're looking at a dataset you've never looked at before you want to lean on these sorts of methods so you can start to develop a feel for the data set. By looking at the first index value and the last index value, we can see that we begin in 2003 and we end in 2019. And because python has been able to effectively parse these strings as daytime objects, we can also do daytime arithmetic and see that we have more than 5000 values over 5,000 days alone. Right so softball simple questions - where does my data begin and end, and how are they distributed?
Get insights like this in person at our next Data Science Salon: Applying Machine Learning & AI to Finance, Healthcare, & Technology, February 18-19 in Austin, TX.
Pandas provides a wrapper around matplotlib, so looking at the data is really easy with a simple call to the hist function, short for histogram, I'm sure none of you knew the end of that word. And here I'm passing in the index to simply get a count of the values. Now this is a good picture because it doesn't seem like there's big areas of time where we're missing data, so conclusions that we reach with this data set might be reasonable. Another thing you might notice here is that I’m setting this equal to an underscore, so I'm getting rid of whatever the return value of histogram is. That's good because if I didn't it would show a cell with all of the array values, or a subset of them, just before the graph.
Let's say we have a question: is Seattle getting safer over time? That's a big question. We're really only talking about traffic and accidents, and furthermore we're really only talking about trend. Right, what this data doesn't show is changes in population, changes in transportation options, changes in transportation volume over time. So it's kind of a one dimensional picture, so the conclusions we reach won't be terribly compelling to a municipal board or a government planning agency, but it might be interesting for us as data scientists or machine learning engineers.
Time series data is believed to be composed of a few different elements. Just a single point, time series experts believe, is made up of four or five categories that, depending on how you slice it, one is called level or the average value, one is trend, one is seasonality, another one is cyclicality, and the fifth one is noise or error. So time series data sets share with all data sets noise, and also probably level. Trends, seasonality and cyclicality might be specific to time series data, so that's what we're gonna focus on. Trend is what it sounds like, is there a long term something in the data? Seasonality and cyclicality are a little more interesting. For seasonality, we know that when they show like unemployment rates they are always deseasonality it. Seasons and cycles sound similar, they feel similar in our brains in our mouths, but they're actually kind of different. Seasonality is tightly bound to calendar intervals that you're familiar with - weeks, months, hours, minutes, years. It’s something that regularly occurs in the time series data set that you can count on and doesn't vary a lot. Cyclicality is not something you see as often; it normally refers to trends that are longer, less predictable, and a lot of times that shows up in occurances like economic downturns. You might have a cycle that sometimes occurs every seven years, every ten years, every three years, whereas with seasonality it'd be every Monday, every Sunday, every fifth hour. So that's the difference between seasonality and cyclicality.
If we were able to decompose the time series into these different parts we could answer this question about safety by showing that there's a downward trend. Graphically speaking, which would be our first tool before trying to decompose the time series? Is Seattle getting safer over time? The peaks that we see in 2006 & 2008 – we never see peaks that high again, but we do see some pretty high volumes in 2016 & 2018. Also Seattle's population has been growing, so the fact that it's going down a little or staying the same really does lend robustness to the answer. Yes, it is getting safer, but let's go through the motions.
There's a few different ways to do seasonal decomposition. If you use a library like stats models, the way that they detect trend is just by smoothing out the data. They take a weighted moving average and they try to get rid of some of the noise so that a dataset that looks like this can then later look like this. I would say that the conclusions that you draw in the beginning from that histogram are similar to the ones here, but you might argue that this is a little bit cleaner. Still we should sort of couch our answers in the question - how far have we really come? Because in 2004 the average number of accidents in a day are somewhere between 30 and 60, 30 and 70, and by the end of 2018 we're still in that 30/40/50 mark, so maybe it is getting slightly safer.
A lot of times when you're using a forecasting or time series library they will ask you to tell them whether the time series seasonality is an additive or multiplicative. Stats models asks you this, and also the Facebook profit library also asks you this. So what does this mean, what's the difference between additive seasonality and multiplicative seasonality? I'll show you a graph and also give you an explanation. This is a picture of additive seasonality. You can see two things here: one is trend – there's an upward trend, the second is seasonality. These are regularly repeating seasons, but what you don't see is that the seasons do not become more exaggerated as a function of time, they're always around the same level of seasonality. Here they are becoming more exaggerated. If you had a data set about how often dogs smile on a weekday and in the beginning Mondays are pretty sad because their owners are leaving, but over time they get more and more sad. Mondays become more Monday-ish: that would be multiplicative seasonality. Most datasets you'll find are additive, but you might happen upon a multiplicative one. By telling these libraries that fact, it will help them decompose it. When they linearly combine all these features, trend, seasonality, noise, they're either going to add them together or multiply them. If you can estimate that correctly (or a subject matter expert can give you that kind of insight) it's a good piece of information to hand to the library.
So since I wasn’t happy with stats models, I'm leaning on another method which I'm excited to talk to you about: resample to smooth it even further. Resampling is when you change the interval frequency at which you ask the data to give you values. Most of the time what you do is you down sample. Down sampling means less frequent intervals – the frequency goes down. So if we started our data set at the level of the day interval, we're gonna down sample and ask for them at a month, or at a year. Down sampling is really the only thing you can ask pandas to do and not ask it to invent data. If your data set began at the minute level and then you up sample and ask it for at the second level, pandas is gonna be sitting there scratching his head wondering where am I gonna get this data from?
Joshua Malina – Senior Machine Learning Engineer at AMEX, speaking at Data Science Salon Miami 2019.
If you really find yourself in a situation where you do want to up sample, pandas does provide a few options for filling in the null values that it's going to give you. There's three in particular that I'm going to talk about right here. One is B fill, which stands for backfill. One is F fill, for forward fill. The other one is interpolate. They are exactly what they sound like. Backfill is going to drag the value that you know from the future and shove it back to where you don't know all the values up until the point that you do know. And forward fill is the opposite of that. So if I know ten dogs are smiling on Monday, and I know that twenty are smiling on Friday, if I backfill Thursday/Wednesday/Tuesday, I think those are all the days there are going to have Friday's value. Interpolate is a little more complicated: if there's a linear trend in between your missing values, panda is going to fit a line from the value you know to the value you also know, and then interpolate the values in between there. A note of caution for those of you who are interested in forecasting – you could introduce leakage into your training set if you do this.
Our data set started with accidents happening whenever, and then I showed them to you at the daily level, and now I'm showing them to you at the six-month level. When you call resample in the pandas library, it asks for a string that looks something like what you want. In this case six is the number six, and M stands for month, earlier I did it with day D. After you asked for that, you're just going to get a group by, or an aggregate object that you can't really look at yet. Then you have to tell pandas what you want to do with it. You might want to take a sum – in this case I'm adding all the accidents that occur in a day. You can see from the y-axis there's around 40 or 50 accidents. But I can also ask for the mean – if we're resampling every 6 months, it means we're taking the mean daily value. Notice the y-axis has not changed – if we called sum on it, the y-axis would be much bigger. Now this is much more smoothed out. I'll ask again, does it seem like it's getting safer in Seattle? It's not as clear anymore. We have many fewer points here so there's a lot less information, but we've gotten rid of all the seasonality and other stuff that might be distracting us.
I'm still unconvinced, so let’s take another look at the data. These are box and whisker diagrams across all the years. This is nice for seeing overlap. We see the mean values and the whisker tails. There's a lot of overlap here so it could be the case that it's getting safer over time. But the data sets aren’t completely linearly separable. It would be hard to build a classifier and to ask it if this is data from the second half of the data set, or the first half of the data? Ultimately it's going to be a somewhat unsatisfying answer.
So if you're not feeling convinced by visualization, or by resampling, we can do hypothesis testing. The ANOVA test is a way of asking your if portions of the data are generated from different distributions. ANOVA is an acronym which stands for analysis of variance. Even though we're trying to detect differences and means, what ANOVA is looking at is the variance across groups compared to the variance within groups. If we detect a lot of variance across the groups we could only be secure in our belief that that's caused by differences in the actual groups if the variance in those little clusters or buckets is also tight – otherwise we can't reject the null hypothesis and it could be a fluke. In this ANOVA test I'm separating all the years, and I'm running it through the ANOVA function to figure out whether the p value or the f statistic that it gives me back is going to merit the conclusion that we are looking forward to: whether or not there's a difference.
ANOVA asks us to only use it in special circumstances if we abide by a few principles. One is if the samples are independent – here the histograms make it clear that they're normally distributed. The other is that the variances are similar. ANOVA is confident that there's some difference here but it hasn't answered our question – is there a downward trend. It's just telling us there's differences across the years, but maybe it gives us a little confidence that we can ask a little deeper.
I decided to chop the dataset in half. Now we have the data from 2003 to 2011, and then from 2011 to 2019. I plotted them and ran ANOVA again. ANOVA is still confident. We’re looking at two different histograms, they're hugely overlapping and this is a long period of time. You can see that the second half of the data is slightly to the left of the right, so maybe it is slightly safer. But another data scientist or engineer might look at this and say actually there's no difference due to all that overlap.
Now we're going to transition to something else that I thought was pretty interesting in the data set. I called df.index, and then I looked for the first 25 daytime objects. I was expecting to see dates and times. It turns out that in this data set the police officer or whoever was recording these accidents really favored the zeroth minute. He might arrive on scene at 12:18, and he writes down in his paperwork that the accident happened at 12. This is happening a lot – it's showing me that over 30% of all accidents in Seattle occur during the first 60 seconds of the hour. This tells me that humans made this data set. Also overrepresented are the 30 mark, the 10 mark, the 15 mark – human times. So if you were doing a forecasting problem and your boss was asking you can you predict the next minute, you would probably say no because this data has these big problems. But if we resample one more time at the daily level, things get better. We even see a nice drop on the 31st, because there's only seven 31st’s in a year.
A couple of books that I've been reading to help me with this stuff: Practical Time Series Analysis, which you can order on Amazon, and Quantitative Forecasting Methods. Both really good books about forecasting, and sort of this nitty-gritty time series analysis stuff. Thank you very much.
Curious for more?
Don’t miss the next Data Science Salon in Austin, February 18-19, 2020.