DSS Blog

Solving Explainable AI Challenges For The FICO Score

Written by Gerald Fahner, Senior Principal Scientist at FICO | Jul 1, 2019 3:28:32 PM

In a world where even the most basic purchases and decisions depend on credit scores, accuracy holds significant weight. Gerald Fahner, Senior Principal Scientist at FICO, explains not only how his team uses machine learning to retrieve individual’s scores across populations, but how such models can also ensure that the right consequences and rewards reach the right people.  With many different model-building tactics to choose from, FICO has performed test after test to find the optimal amount of constraints to best serve customers and replicate results with maximum sensitivity.

I work for FICO in Austin and the speaker before me now just mentioned the importance of explainability in AI. Obviously, this is more and more of a concern now that we are building too many black boxes and learning biases and potentially suffering garbage in garbage out situations. 

I sometimes say the FICO score is probably the most scrutinized model in the world, so when we try to apply the latest and greatest machine learning and AI techniques we need to be especially careful that everything remains explainable. FICO built the first credit scoring model back in the 60s, using earlier versions of discriminant analysis that existed before logistic regression came along. 

Chances are that whenever you swipe a credit card and whenever you apply for a car loan, a mortgage or any other loan product, you get scored several times by FICO. In fact, a large number of decisions are made every year alone in the US and FICO scores are now available in 25 plus and countries. That amounts to about 10 million decisions per year. Obviously, each of these decisions can be potentially life-changing for the consumer. If you get declined for mortgage, how bad is that for your score? Moreover, if you have a low FICO score and you get a car loan, you're probably going to pay thousands of dollars more in interest than somebody with a higher FICO score. So it's very important to keep these models very transparent and explainable. 

Credit scoring is heavily regulated so there are a number of laws against disparate impact. When people are declined because of credit, we need to be able to explain to them the reasons why they were declined. FICO scores are a central part of the lifecycle of financial lending, which includes account management and collections. 

So let me just give a brief, high-level explanation of what the FICO score is:  It's a number between 300 and 850 and the higher the score is, the better. The score is based on the information that exists about consumers in the credit bureau agencies like Experian, TransUnion and Equifax the big ones.  The score is designed to rank order risk, but it is not designed to give you accurate forecasts of the probability of delinquency because that also depends a lot on the economy and its cycle. The types of products sold and their design needs to be calibrated by the portfolio managers, who must keep track of changes in consumer behavior and the economic cycle at large. The FICO score itself is designed to be a very robust ranked order of risk so we cannot say for certain which rank will default. However, we can say, “People who look like Gerald have odds of 20 to 1 to pay back as expected.” 

Let's say that given a particular FICO score, maybe one out of four people goes delinquent and at another score, it's one out of seven. Typically, the FICO scores scale so that 20 points double the odds. 

Often, in statistical models, they may be linear or exponential, when they are logarithmic. FICO scores actually work the opposite because FICO scores are based on a technology called score parts. To build a score part, you typically have a few predictive variables that are already categorical.  For example, do you own a house or do you rent? Then, there is a very large number of variables that are derived from the credit bureau. This could be information from the credit records about your payment history. For example, how many times in the last year or two have you experienced that they linked a late payment? Or did you have a delinquency in the last year?  Perhaps more than one? 

You can see for each of these variables there are Coulomb characteristics. They are ordinal ranges that are pinned into attribute ranges and then each range of each attribute gets a point. Here, you see that if you had some historic delinquency, you get more points the longer ago it happened. If it happened more recently, that's a bad sign, so you get fewer points. Obviously, if you never had a serious delinquency, you get the maximum points. This is what you would expect to see from train mode, but it turns out ever so often that there are certain things that don't quite make sense or at least, they're not fully intuitive. So for example, the higher percentage of utilization on your credit card, the worse off you are. People who are close to being maxed out tend to be worse risk. 

Typically, people would expect a monotonic relationship across the ordinal range, but sometimes you see something weird. Maybe people at forty percent utilization are suddenly a little bit better off than people at 30 percent. In that case, there is a certain amount of engineering needed to assign these score points. They are empirically derived, but subject to constraints on explained ability, so we use certain types of nonlinear programming to optimize those score weights subject to constraints. There is obviously virtually hundreds of these characteristics and they are all kind of domain, expert-derived characteristics. Each of these “features”, as we would call them in machine learning, is very explainable: The assignment of points to the features is very explainable, and then there is a characteristic selection algorithm that parses through a library of 500 variables and may select 10 or 15 of them as characteristics into the score part. There is a whole science and art to pinning the right characteristics. If you make these attributes too narrow, you get a lot of noise. If you make them too wide, you're approximating a smooth function through a very high granular stare function and you will lose some signal there. 

There’s definitely a good old bias-variance trade-off in constructing these models really well, but what I want you to remember is this simple model that there is easy to explain. There is a certain amount of domain knowledge that goes along with pulling the most signal out of the data. We combine learning from data with domain expertise and business constraints on explainability when we construct these models.

 

See talks like this in person at our next Data Science Salon: Applying AI and Machine Learning to Finance, Healthcare and Hospitality, in Miami.

 

But how do our models consistently capture interactions in the right way?  To compute a score, the model will add up the attributes into which a consumer falls. So let's say this guy had a delinquency 30 months ago  - from that he gets fifty five points and overall repudiation is thirty percent. Then from there, he gets forty five points. When added up, the sum of the points becomes the FICO score. It's not quite as simple because it turns out the population is very heterogeneous and you can do better if you have several scores for different sub-segments of the population. Let's say we start with a population of a hundred fifty million people and then you split it into those who had historic delinquencies and those who were always very clean. Then, underneath there may be another split so you end up with a binary tree structure. If you want to create a partition of the full population in two mutually exclusive and exhaustive segments, and then divide those segments by their scores, you will have a fairly sophisticated system because each segment can use different subsets of characteristics and different meanings. You can show this mathematically.  When you write the formula down for the FICO score as a segmented score part, it captures interactions between all the variables that are used to define the splits in the tree. 

As you can imagine, the development of this system is very tedious because there are so many multiple score cuts involve.  A lot of score engineering involves figuring out what a good segmentation means. There’s a kind of an exponential search space that differs from country to country. We just did a project for the UK and they have a different segmentation from the US. When countries differ in consumer behavior, their data sources also look very different and the segmentations themselves could be totally different. Alone, the segmentation analysis would demand using more AI and machine learning to get a higher degree of productivity in the construction of these models. That’s one of the reasons we are very interested in machine learning and obviously benchmarking as well.

You never quite know when you construct these models whether or not to leave predictive information on the table.  If you could move the needle a little bit too predict a little bit better, that could make a big difference. Characteristic selection is an automated process that is part of machine learning. We always balance this machine learning with the domain expertise to ensure transparency compliance with the regulation and also to mitigate biases. Credit scoring data can be full of biases because you get this kind of selection bias through which you can only observe the performance of people who have gotten a loan in the past. People who didn’t are rejected from the system. So you need to be very careful about that as well/  

Nevertheless, the question is always how does this FICO score model compare with the latest and greatest of machine learning algorithms? Basically, this is a typical ensemble type of model that combines the predictions from many classification and regression trees. Each tree is a good old binary tree trained by recursive partitioning to optimize something like a separation between good and bad chaos. They are all eventually combined into a prediction function called the score and you can think basically of the score as just a regression approximation of the input data set. 

My short definition of machine learning is basically that it takes data and turns it into an algorithm. The algorithm is how you can compute a score based on knowing the values of the predictors. It is an immensely complex model and the big difference between these models and the scope of others is that these models have tens if not hundreds of thousands of parameters that have no meaning to anyone. It’s not like regression coefficients where you can recognize a slope or a sensitivity or an elasticity. These parameters in those trees have no meaning and we're using gradient boosting, which can be known as the first sequential method to train each tree on the residual of the previous tree. We need to make sure we are not overfitting but the end model is kind of a black box. There are many good algorithms that can diagnose these black boxes by basically simulating the input-output relationship to get a sense of which features need to be changed. For example, if you change the size of an input, how will the score change? When you plot these curves into what is called partial dependence functions, you may get meaningful results. But as I said earlier, we also need to be able to go into the model and put some constraints and make some tweaks to make sure it is fully explainable. 

We have people from our fraud team who have a lot of experience with deep learning and neural nets.  At first our work may have been a little frustrating and underwhelming for some because we had hoped to find a bigger improvement.  But I think the reason we haven’t is because the FICO score is already a very sophisticated model, capable of capturing nonlinearities in interactions. It has been developed by generations of analysts and scientist over the last twenty years or so.

It’s easy to show that machine learning has great lift over very simplistic models and that's a marketing tactic. But when you have a very good model built on something like segmented logistic regression, it's much harder to get a lot of lift from machine learning.  However, the really good news with this latest and greatest machine learning is that it breeds productivity. So we are basically able to in one week get as much if not a little bit more predictive power squeezed out of the data than with a full-blown segmented score development, which would take many people several months. So since we build these scoring models a lot in research, like to see if it is time to build a new FICO score or if there is a new data source, to see if that gives some lift to the existing model. We are constantly building FICO scores in the lab, so for us, having this machine learning is a great productivity boost to be able to say much quicker. 

Although we think there may be value in a new data source, or we think it's time to build a new FICO score, we see improvements already and so this has been a big success. But what I will show you next is how we blend all the technologies and combine this time-proven transparent score technology with the latest machine learning to create something that’s much more acceptable in our market. We can credit score for regulators, for users, for stakeholders, for other analysts and even risk managers. So this is the way we incorporated newer machine learning into our FICO score research and development. 

It's basically an axiom in the credit industry that if you pay down card balances, you become a better risk and your score should only go up if anything. So we put in this probe model with a machine learning model to see under this simulated scenario what will happen to the simulated score. We ran this on a large sample of approximately ten million random consumers in the US and none of them received a score drop as a result of paying down 90% of the credit card balances when we used a FICO 9 model, which is our latest. But with the stochastic gradient boosting model, 9 percent of consumers experienced a drop in the score. It was a very slight drop, but nevertheless it's something that would certainly confuse consumers, bankers and regulators. It’s not only about accuracy. Your model also needs to pass the common-sense tests. Perhaps the stochastic gradient boosting is a bit more accurate, but it doesn't pass the common sense tests and so it won't be a good model for the FICO score. We want to introduce this blended idea of how to bring together different technologies. We can benefit a lot from parallel computing to diagnose the model. We can compare baseline additive models with models that capture low or higher-order interactions and find ideas of variables that interact well with many other variables. These tend to be good candidates for segmentation so we gain a lot of information out of them. 

The overall key is to score out the development data set with the best machine learning model and attach the development data set with the best machine learning score. Initially, the target is simply binary, but now we add another column to the data set that is basically a distillation of everything we know about a given consumer’s degree of badness.  It’s a number between let's zero and a probability of badness. Instead of deciding segmentations manually, we run a recursive algorithm that's similar to the court classification and regression tree algorithm. It has score cuts built into the segment, starting with a single score at the top at the root node for the full population. We no longer try to predict the binary target like in a logistic regression, but we try to predict the machine learning score in the least square sense. This filters out a lot of noise so it allows this recursive tree algorithm to focus less on the noise and more on the signal that the machine learning model has found. .

Chances are the prediction won't be that great because that scorecard cannot capture the same number of interactions that the machine learning model captures. The algorithm tries all kinds of splits using all the variables we have in our characteristics libraries to define possible splits into children branches. We can then compare these splits for which one is the best and which one will give a substantially better performance than that parent model.  We have millions of records to train so you can usually do better than a single score part. We can also use the same algorithm to find the optimal split and then see whether optimal split is better than the parent. We do this again and again until there is no more significant performance improvement. The score parts that are part of these segmented score trees are also trained in compliance with the palatability constraints on which we insist. It's not strictly a maximum likelihood issue where I just estimate the score weights to get the best possible approximation of the target, but it's always subject to these palatability constraints so that the end result is the best possible segments. 

In this way, I think we make the most of the data but without running the risk of delivering something that has parts and areas that are difficult to explain. The goal is trying to marry the best of both worlds.

 

Don’t miss the next Data Science Salon in Miami, September 10-11, 2019.