Bias and Interpretability in Machine Learning

by DataScienceSalon | Technology

Reading Time: ( Word Count: )

Presented by Fatih Akici – Manager, Risk Analytics and Data Science at Populus Financial Group during Data Science Salon Austin, you can watch the full talk here.

As intelligent systems deepen their footprints in our daily lives, algorithmic bias becomes a more prominent problem in today’s world. Using an applied example, Faith argues that leaders should be proactive in identifying biases and outlines benefits from fixing them.

Fatih Akici is a self-proclaimed PhD-dropout whose main goal was to put puppies in his presentation, and he was wildly successful there. Read on for a thorough demonstration of real-world bias in an algorithm we build together, which is less successful!

THE MODEL

We’re building an image-classification model that separates dogs from wolves. We will basically use our own image data instead of getting a pre-packaged, nice, standardized cleaned-up from a Python or R-library. We will build an awesome model which is going to fail miserably; we will come back and explain why the model was looking so great and then why it failed so badly using LIME. We’ll basically discuss the pragmatic and ethical aspects of the issue but I think what really makes this presentation unique is that it will deal with puppies.

Our data set consists of 45 dog images and 45 wolf images. In total, we have 90 data points which we are going to split into training and validation sets. We will use 67 percent of this data in training and the remaining 33 in testing or validating the model, so we will train an image classification model on just 60 images: 30 dogs and 30 wolves. We can’t expect it to really perform.

Screen Shot 2020-04-16 at 2.26.29 PM

This is just a tiny snippet of modeling the training piece of the full workflow. We initialize the model; we’ll just have five epochs to train this. The accuracies of each epoch right up to the last one comes in at close to 85 and 86 percent—so with just 60 data points, this is surprising, right?

Evaluating the model by calling the built-in model shows me that training accuracy is 85% and validation accuracy is 83%, which means I can build an image classification model on 60 images and it performed fantastically on the held-out data set as well. But that didn’t use cross validation; there’s no optimization, no model selection, it’s just one run over the data to get to how the model performs and we’re done.

This is just an example exercise. It could be a healthcare model, a financial model, an admission type of model, or recruitment, HR-type of model but based on the data we had in hand: 90 records, we built it, we have incredible confidence in this model, now we just deployed it, and we’re accepting it: we’re opening the gates and welcoming new cases to evaluate.

WHEN THE MODEL FAILS

Let’s try the real-world test with the images we held out: three puppies and three wolves. We’re processing and standardizing the data just as we did in the training phase: simple resizing to 200 pixels, putting that in a loop so that this code is going to show us each image and the classification.

So the first one fails, but it’s a husky puppy. Maybe they look like wolves, so we still haven’t lost our faith in the model. The next case, the fluffy little puppy, is classified correctly. The next one, a cute flying puppy, is classified as a wolf. That’s unacceptable. So this breaks our hearts.

If we go to three wolf images, we can see how they are all classified: as dogs.

The real pain is that you know you have a problem when your model confuses a little puppy with a vicious wolf!

How is the model so successful in the training validation but in the true test, fails so badly?—it classified only one sample correctly.

VARIABLE IMPORTANCE

This is a very high-level summary of variable importance: when we talk about variable importance, or feature importance, or attribute relevance, you basically talk about which features are the most important to the model overall. We care about the importance of attributes that caused this failure, so after deployment, how do we get the contributions of each attribute to the final decision of a machine learning model?

We have two approaches to this:

One, we open the mathematical algorithm if it’s simple enough—this may be preferable. If we have a logistic regression, it’s just one linear equation, so basically we answer the question “how far am I from the average guy?” Then we already have the coefficients, we can just multiply things and sort, and then the top variables are your most influential variables that led to the model’s decision in a particular sample.

Or, in a random forest, we know that random forest is basically a bunch of trees that basically traverse or branch out, that start from the node. We have certain statistical parameters that we monitor and then we finally come to the leaf, which is basically classification. So we can express this relationship as a mathematical one-liner.

Or you can just say “I don’t really care about the mathematical details of things, I’ll just just go model agnostic”. The two main approaches through model-agnostic explanations are LIME and SHAP. As a summary, LIME is just a subset of SHAP. We’ll use LIME in this example.

LIME works by taking the sample that you want to explain—hey model, you just told me that this image is wolf, how’d you come to that conclusion? and LIME takes the sample, perturbs it, giving as much variation of that image as possible so that each and every variation can be scored and we can fit a linear model to the decision boundary. Even though the decision boundary can be highly complex, LIME explains things locally: in the vicinity of the things you give to it.

So we perturb the dog image and the wolf image and run these through an algorithm that’s going to include some weighted linear models and basically that’s where the magic happens. LIME says why it classified the way it did. In our example, LIME is classifying dogs versus wolves based on attributes or features that have nothing to do with being a wolf or a dog!

MODEL BIAS

So this basically is a demonstration of how bias in training data is only going to get reflected in the results. Prejudices and our own perceptions, our past mistakes in labeling data, is going to be reflected in the results of the model regardless of how complicated that model is.

If we come back to the training sample, we can take another look: all dogs had green backgrounds. They were all in a backyard or playground or nice forests. All the wolves were in white backgrounds: snowy, more harsh, and there is basically no sign of life.

So the main ethical or philosophical question is: what cases in our daily lives are we basically subjecting different groups to a nice background versus harsh background? A green background versus white, cold backgrounds.

EVERYDAY BIAS IN AI EXAMPLES

Let’s look at everyday examples of bias in AI. The first example is a recruiting tool Amazon used that they discovered is just biased against women. Why? Because women always had the snowy background; they don’t really exist in the data set. The model didn’t have an attribute of “this is a female” versus “this is a male” but still there are certain characteristics that the algorithm learned. Maybe the algorithm learned female names, or female hobbies, and at the end, it was going to continue to recommend hiring more male technical experts.

The second one is AI is sending people to jail and getting it wrong! In our courthouses today, judges are using AI model recommendations in their bail decisions or other decisions that have the potential of impacting people’s lives. Think of just introducing one dumb variable to your data set—your credit file or legal file—and it could permanently change the trajectory of your life. So if one ethnic group may be more associated with crime, the AI does the same thing that Amazon’s hiring algorithm did, which is to continue to recommend what was happening in the past: to continue labeling the groups that had snowy, harsh, cold backgrounds as wolves, vicious creatures, and the dogs coming from green backgrounds, nice spring weather and nice temperature, would be labeled as okay to approve the bail application. The example is pretty striking.

The third example is also pretty striking: a scientist from MIT realized that facial recognition applications did not detect her face—she’s an African-American—her color was totally invisible to the algorithm. How she saw that is that she put on a white mask. The tool says okay there’s a human here. The bias in the data is that the algorithm didn’t see enough African-American data points to train on.

PRAGMATIC AND ETHICAL RAMIFICATIONS

So this basically lets us face our prejudices: we’re fueling the model with our own prejudices and past mistakes. It may be statistically and mechanically true that when you just look at a given data set, certain ethnic groups may be more associated with crime. That may be statistically true.

Or, if you go on the street and take a random hundred men and hundred women, you may come to a conclusion that men usually are better at math and women are better in more artsy things and this may be statistically true BUT the next woman that steps into our machine learning model, which we deployed after training on that imbalanced or unfair data set, is going to be labeled less capable of being good at a technical job than a random male. So there’s a very thin line here between what is statistically and mechanically true and what is actually true or philosophically true or ethically true.

We have to be aware of what prejudices we’re fueling our models with, regardless of how complicated that model is.

In the process of confounding factors, correlations appear as causations.

If one minority is more associated with crime, you may think there is something inherent in that group, and if I am a judge, I wouldn’t really be wrong in labeling the next member of that group as a criminal. I would be less wrong statistically: that is true; but is it ethically true? All we’re doing is reinforcing our own past mistakes that existed in the training data and got reflected in the result.

—

Love this talk and want more? Sign up for one—or all!—of our virtual salons.

May 7: Data Science Salon Elevate: Virtual

June 18: Data Science Salon Elevate: Virtual.

July 30: Data Science Salon Elevate: Virtual

August 27: Data Science Salon Elevate: Virtual

Post Category: Technology

Tags: Ethics | Machine Learning

← Previous Next →