Staying updated in our vastly changing world is very important, but breaking news is not the only content out there. Anna Coenen, Senior Data Scientist at the New York Times, is determined to make sure that evergreen content reaches the audience that will value it the most. She’s using algorithms to do it and in her talk at DSSNYC in June 2019, she explained the details of how.
Hi everybody and thank you for inviting me to talk at this conference. My name is Anna Coenen, Senior Data Scientist at the New York Times and I work on algorithmic recommendations. We're currently working on some new products at the Times, but I’m mainly going to talk about some stuff happened last year but still is very relevant. Everything revolves around the use of contextual bandits, which we are using in multiple places.
I don't think I need to convince anybody in this room that algorithms are useful for content recommendation. Part of the reason why algorithms have become such an important area of development is that they are becoming increasingly a priority of the businesses that we are dealing with.
We're publishing around 250 pieces of original journalism per day and not all of it is breaking news. A lot of it is deeper coverage like long-form articles. The problem is that we're obviously dealing with super limited space to promote this content. On the New York Times homepage you are only seeing less than a hundred articles at a time, and some stuff at the bottom is really hidden. So that's one reason we need to optimize that space by getting the right content to the right audience of people.
It’s a lot to ask to have humans do this curation all the time. We could ask our editors to go through every single part of the homepage and swap out the content multiple times a day as new articles get published, but editors understandably would like to focus more on the areas of the paper that are more creative. Editorial judgement is reserved mainly for top stories and breaking news and we don’t intend to replace that, but we also have to be respectful of editors’ time.
Moving on to the main goals goals of algorithmic recommendations, we obviously want to make sure we use the space on our website efficiently to drive engagement. However, algorithmic recommendations can also help us free our resources as I mentioned. Another reason, which I feel really passionate about, is elevating content that is not necessarily breaking news. We refer to it as evergreen content, or in other words articles from years past that might still be relevant to people today. Content is evergreen to different degrees. Some might only be relevant throughout different times of the year or even just a couple of weeks. Right now, we're basically burying all that content and constantly replacing it with what’s newest. If you're coming to us not as a news junkie, who refreshes all the time and tries to catch every article, you will miss out on a lot of content that might be relevant to you. Algorithms are a great way to resurface that valuable content and make sure it gets seen. This goes in line with being more personally invested in people's interests and what we're recommending to them.
So where do we currently use algorithms? To give you a little bit of a historical background, we've been going through a lot of transformation around this. Until 2018, we just had a “Recommended For You” module on our website for which anything that we published I think in the last seven days was eligible to be recommended. Then last year, we had a paradigm shift: rather than having one big bucket of content that is eligible, we would have our editors pre-curate pooled content that would be eligible to be shown in particular parts of the homepage.
If you're an avid New York Times user, especially on the apps, you’ll be interested to know that we're currently working on some new features that will allow you to customize what you would like to be reading. We present our readers with a newsfeed style article feed where you can customize your own interests and we're actually going to add some algorithmic work on that as well. It’s currently only in iOS but soon, it will be in Android and it's going to be made quite prominent with a new version of the app.
We are often asked what are some particular challenges for recommending articles that you will not find when recommending movies or music. First of all, we have the problem that our catalog changes all the time and so we have what's often called a “cold start problem” on the level of the article. When a new article gets published, we don't know how engaging it will be and we have the same problem for a lot of our users. We have a lot of users that aren't subscribed or registered with the Times and we have little or no history about these people. And we can't be super personalized for everyone because we don't know where they're coming from and we're not using any third-party data to do recommendations. If they don't have a reading history recorded with us, we can't really do anything for them.
If you're thinking about this as a machine learning problem, the rewards signal, which in this case is some measure of engagement or click, is non-stationary. Articles often peak in terms of interest and then very quickly decay. But then, there are some articles that stay relevant for people for longer periods of time and so you kind of want a model that is able to capture either of those things. You don't want to keep showing new stuff only, but you also don’t want to show outdated things unnecessarily either.
The modeling approach that we took used algorithms from the reinforcement learning literature that are really good at balancing what we call explorations. So in this case, that just means trying out new action and learning what kind of what what reward you're incurring. You learn over time by testing all the actions available to you which ones work more than others and you apply only those in the future. This approach is good for changing environments to maximize reward over time.
Anna Coenen, Senior Data Scientist at the New York Times, speaks at DSS NYC 2019
For this talk, I'm just going to focus on click-through rates, which are the number of clicks over the number of impressions that an article gets. The goal is to optimize the reward given also some information about the context, which could be some information about the user. Underlying the bandit that we use is basically a model that associates a set of linear weights with every article given some context. In this case, the context would be how much of a reward an individual click or a click-through rate will on average incur from that particular article. We retrain the model at 15 minute intervals based on new data coming in from how people interact with it. That means every 15 minutes, the model revises its belief about how engaging every article is.
Interestingly enough, the model attempts to do this whole process in real-time. When a user comes in and we have a bunch of articles eligible to be recommended to them, the way that the bandit recommends them is that it actually takes the posterior of its belief over the click-through rates of these articles and samples from it. What that does is that it makes it sure today's bias towards things that are engaging is taken into account but it still also recommends new things that it doesn’t know much about yet. There's a parameter in there that you can tune that determines how exploratory the algorithm is, in other words how likely it is to play something new and unknown versus maximizing and playing the thing that it knows everybody clicks on all the time. To deploy this bandit, we are completely using Google Cloud Platform. Such training happens efficiently in Kubernetes. We pull data from BigQuery and then deploy the model as a container again.
I do want to take a little bit of time to talk about bandits that we currently have in production or that we have had in production in terms of the results that we've had. You can imagine that even a contextual bandit, you can basically fit a model that essentially actually doesn't have any context, which in this case all it does is try to find the right articles to show without any personalization. We also have a model that uses a user's location and users’ preferences across different desks or sections of The Times.
A non-contextual bandit might just estimate the click-through rate of an article based on one intercept. You can essentially think of this as a linear model. For this particular model, all we had to tune is that alpha parameter that governs how exploratory the algorithm is. We tuned down our forgetfulness rate a little bit by trial and error, but also by back testing the algorithm. We used some past data on how people engage on our site and basically replayed an algorithm for different settings of those parameters. Then, we saw which of those parameter settings yield at the highest expected click-through rate.
We always test all of our models against some baseline. In this case, we had a comparison against what editors are doing. Basically, the editor curates a story block to an algorithm and does it in the other two as well. We had a random baseline to compare to. Remember that in these cases, the pools that we're talking about were pre-curated by editors and doesn’t represent all content as a whole. That said, the experience would have been somewhat tailored anyway. But I still think that the lifts are pretty impressive across all of those surfaces. You can see from this that the bandit is more likely to give impressions to articles that are a little bit older than the editors would. It may be randomly chosen, but partly the reason that the bandit works well is because it's able to recycle older content that is continuing to be engaging and that we would have otherwise not recommended.
Now, in addition to just an intercept, we're also computing a coefficient for different geographical locations for every article. That means that for every article we publish, we're now not only determining how engaging it is in general but also how engaging it is in a particular region. We defined seven U.S. reasons and six international regions from which the model was able to learn how engaging each article is. A good example of this was the midterm elections during which some of the content was more relevant to people in different regions and we had reason to believe so because some of the races local. We're definitely getting a little bit more click-through rate overall when we use this model compared to a non-contextual bandit.
In addition, we have a measure that we call “effective catalog size”, which is basically a measure of how many different articles an algorithm is recommending or policies it might be recommending. This “geo bandit” is a little bit above the non-contextual bandit in cabilities when we factor in how many different things it recommends. Now, in addition to picking up the articles that are popular everywhere, this model can pick up articles that are popular in a particular region.
To end, I'm going to talk about the first kind of bandit that we are using. It focuses on more of a behavioral personalization. By behavioral, I mean not just where a user is coming from but actually something based on their reading history with The Times. In this case, we added a new set of intercepts that captured the likelihood that a user is engaging with an article. This likelihood is weighted by the proportion of page views from that user to a specific desk at The Times. This model was able to capture how likely a user is to engage with a particular article from a given desk given how much they've liked articles from that desk in the past. All of the correlations between past and future engagements are positive, which makes sense because it shows that your behavior in the past predicts your behavior in the future. It’s interesting to look at the magnitude of the correlations, in other words how much more engagement did we get by adding certain desks?
We're currently in the process of taking on new projects to answer this question. We want to allow users to customize what they want to see, which will have a prominent effect on usage within the app because it’s the first time we are trying personalization. We have to carefully determine how an element within the app state will behave differently from data that we have from the web. When we do succeed in this process, it will be very visible and hopefully, a big success.